6.4 Analytic Technique Tutorial

6. SIMPLE REGRESSION ANALYSIS › 6.4 Analytic Technique Tutorial

6.4 Analytic Technique Tutorial


Historically, linear regression has been one of the most
commonly applied techniques for forecasting computer
requirements (ART79).  Simple Regression Analysis uses linear
regression to analyze a series of observations of user-
selected variables from any of the files in the CA MICS
database or user-defined variables that are computed from
data elements contained in the CA MICS Database.  This series
of historical observations can be developed from either the
WEEKS, MONTHS, or DAYS timespans.  The historical series is
analyzed using SAS PROC REG to develop a linear model that
relates the user-specified variable to the independent
variable time.  This section provides an introduction to the
regression analysis technique that the facility uses.

The fundamental concept of linear regression (MIL65) is that
the curve of some dependent variable y is linear as a
function of x.  That is, for any given x, the mean
distribution of y can be computed as:

                   y = b + m * x                      (Eqn 1)

     where b is a constant that defines the mean value of y
               when x is zero

            m  is the slope of the line

The estimated y values exhibit differences from the actual y
values that are used to produce the linear model.  These
differences are denoted by an error term, e.  Therefore, we
must expand Equation 1 to include an error term as shown in
Equation 2:

                   y = b + m * x + e                  (Eqn 2)

The error term represents small random variations about the
expected mean value of y.  The value of e for any given
observation depends on possible measurement errors and
variables other than x which may influence y.

To give an example of linear regression, consider the
following data which describes CPU utilization (% CPU) as a
function of the number of active address spaces (ASIDs):

obs          1      2      3      4      5      6      7
------ ------------------------------------------------------
ASIDs       10     15     20     25     30     35     40
------ ------------------------------------------------------
% CPU       50     59     62     69     75     78     83

As the SAS plot in Figure 6-2 indicates, a linear
relationship exists between percent CPU busy and the number
of active address units for the data presented in the
previous table.

Thus, if you hypothesize that a linear relationship exists
between the number of active address spaces and percent CPU
busy, you are faced with the problem of how to estimate the
parameters m and b that are introduced in Equations 1 and 2.
The statistical technique that we used to estimate these
parameters is called the method of least squares.  The method
of least squares produces estimates of the parameters m and b
such that the error term, e, for each of the observations is
as small as possible.

Since SAS PROC REG is used to estimate the values of the
parameters m and b, you need not understand the details of
the least squares technique to use the software.  However, a
brief description of the least squares method is presented
below to provide you with an overview of the technique.

The technique for estimating the values of m and b initially
requires the computation of five parameters:

    SX   The sum of the x observations

    SY   The sum of the y observations

    SSX  The sum of the squares of the x observations

    SXY  The sum of x times y for all of the observations

    n    The number of observations

These parameters are used in the following equations to
determine the values of m and b. The equations are called the
"normal equations."

    SY  =  b * n  +  m * SX                           (Eqn 3)

    SXY =  b * SX +  m * SSX                          (Eqn 4)

The normal equations are a set of two linear equations which
can be solved simultaneously to determine the values of the
parameters m and b.

For the sample problem previously discussed, we determined
the values of the SX, SY, SSX, SXY, and n to be:

    SX   =     175             SXY  =  12,650
    SY   =     476             n    =       7
    SSX  =   5,075

Substituting these values into the normal equations,
Equations 3 and 4, we produce the following:

                 476  =   7b   +    175m

              12,650  = 175b   +  5,075m

In solving these equations, you will find that b = 41.21 and
m = 1.07.  Using these values, you can expand the table of
data that presents the actual CPU utilization and ASID count
observations to include the estimated value of percent CPU
busy (denoted by y est) and the value of the error term, e.


obs          1      2      3      4      5      6      7
------ ------------------------------------------------------
ASIDs       10     15     20     25     30     35     40
------ ------------------------------------------------------
% CPU       50     59     62     69     75     78     83
------ ------------------------------------------------------
y est       52.0   57.4   62.7   68.1   73.4   78.8   84.1
------ ------------------------------------------------------
  e         -2.0   +1.6   -0.7   +0.9   +1.6   -0.8   -1.1

Although the regression of percent CPU busy as a function of
number of active address spaces achieved good results, the
same data analysis repeated with data from a different system
could achieve much more suspect results.  We alluded to the
reason for this potential variation when we introduced the
concept of the error term in an earlier paragraph.  The error
term represents potential measurement errors and/or
influences on the y variable from independent variables other
than x.

In the case of percent CPU busy, this analysis is also
influenced by the CPU absorption rate of each of the address
spaces.  For the example presented above, the rates are
apparently similar.  However, for systems processing a wide
variety of workload types, differences in CPU absorption
rates for the individual address spaces could result in
significant differences between the actual and estimated
observations.  Statistical estimators for determining the
goodness of fit of regression models is discussed in Chapter
7, Univariate Model Forecasting.

PLOT OF CPU BUSY VERSUS NUMBER OF ACTIVE ASIDS PLOT OF CPUBUSY*ASIDS SYMBOL USED IS * 85 + * 80 + * P 75 + * E R C E N 70 + T * C P U 65 + B U * S Y 60 + * 55 + 50 + * --+---------------+---------------+---------------+---------------+---------------+---------------+-- 10 15 20 25 30 35 40

NUMBER OF ACTIVE ASIDS


 Figure 6-2.  Percent CPU Busy vs. ASIDs