7.4 Analytic Technique Tutorial

7. UNIVARIATE MODEL FORECASTING › 7.4 Analytic Technique Tutorial
7.4 Analytic Technique Tutorial


Regression is one of the most popular techniques used to
forecast computer requirements (ART79).  Univariate Model
Forecasting uses regression to analyze historical
observations from either of two sources:

o  Capacity planning resource element files based on data
   from any of the files in the CA MICS database

o  Capacity planning business element files

You can develop a series of historical observations in a
resource element file either from the DAYS, WEEKS, or MONTHS
timespans in the CA MICS database. You can extract each data
element in such a resource file directly from a CA MICS file
or calculate it from one or more CA MICS data elements
obtained from one or more CA MICS files.

The historical series in a business element file can
represent various volume indicators related to your
organization's business or any other non-CA MICS data source.

Either type of historical series is analyzed during
Univariate Model Forecasting by using SAS PROC REG (SAS81) to
develop models (linear, quadratic, or cubic) that relate
user-specified variables to time as the independent variable.
Thus, Univariate Model Forecasting is based upon the
hypothesis that there is a trend in the historical series of
data that can aid in predicting the future.

SIMPLE REGRESSION

To review the concepts of simple regression (MIL65),
hypothesize that the curve of some dependent variable y is
linear as a function of x. That is, for any given x, the mean
distribution of the ys can be computed as

                   y = b + m * x                      (Eqn 3)

    where b is a constant that defines the mean value of y
             when x is zero

    and m is the slope of the line

The estimated y values exhibit differences from the actual y
values that are used to produce the linear model.  These
differences are denoted by an error term, e.  Therefore, we
can expand Equation 3 to include an error term as shown in
Equation 4:

                   y = b + m * x + e                  (Eqn 4)

Basically, the error term represents small random variations
about the expected mean value of y.  The value of e for any
given observation depends on possible measurement errors and
variables other than x that may influence y.

To see an example of linear regression, consider the
following data, which describes CPU utilization (% CPU) as a
function of the number of active address spaces (ASIDs):

obs   |      1      2      3      4      5      6      7
------|------------------------------------------------------
ASIDs |     10     15     20     25     30     35     40
------|------------------------------------------------------
% CPU |     50     59     62     69     75     78     83


As the SAS plot shown in Figure 6-2 indicates, a linear
relationship exists between percent CPU busy and the number
of active address spaces for the data presented in the
previous table.


METHOD OF LEAST SQUARES

Thus, if you hypothesize that a linear relationship exists
between the number of active address spaces and % CPU busy,
you have the problem of how to estimate the parameters m and
b, which we introduced previously in Equations 3 and 4.  The
statistical technique that we used to estimate these
parameters is called the method of least squares.  The method
of least squares produces estimates of the parameters m and b
such that the error term, e, for each of the observations is
as small as possible.

Since you can use SAS PROC REG to estimate the values of the
parameters m and b, you need not understand the details of
the least squares technique to use the facility.  However, a
brief description of the least squares method is presented
below to provide you with an overview of the technique.  The
technique for estimating the value of m and b initially
requires computing five parameters:

    SX:      The sum of the x observations

    SY:      The sum of the y observations

    SSX:     The sum of the squares of the x observations

    SXY:     The sum of x times y for all of the observations

    n:       The number of observations

The next two parameters are not required to calculate the
values of m and b, however they are used to calculate two
other values that will be used later to evaluate the fit of
the model.  These values are the following:

    SSY      The sum of the squares of the y observations
    _
    Y        The average of the y observations

The first set of parameters are used in the following
equations to determine the values of m and b.  These
equations are called the "normal equations":

    SY  =  b * n  +  m * SX                           (Eqn 5)

    SXY =  b * SX +  m * SSX                          (Eqn 6)

The normal equations are a set of two linear equations that
can be solved simultaneously to determine the values of the
parameters m and b.

For the sample problem previously discussed, we determined
the values of SX, SY, SSX, SXY, and n to be

    SX   =     175             SXY  =  12,650
                               _
    SY   =     476             Y    =      68

    SSX  =   5,075             n    =       7

    SSY  =  33,184

Substituting these values into the normal equations
(Equations 5 and 6) produces the following:

     476  =   7b   +    175m

     12,650  = 175b   +  5,075m

In solving these equations, you find that b = 41.21 and that
m = 1.07.  Using these values, you can expand the table of
data that presented the actual CPU utilization and ASID count
observations to include the estimated value of percent CPU
busy (denoted by y est) and the value of the error term, e.

obs   |      1      2      3      4      5      6      7
------|------------------------------------------------------
ASIDs |     10     15     20     25     30     35     40
------|------------------------------------------------------
% CPU |     50     59     62     69     75     78     83
------|------------------------------------------------------
y est |     52.0   57.4   62.7   68.1   73.4   78.8   84.1
------|------------------------------------------------------
  e   |     -2.0   +1.6   -0.7   +0.9   +1.6   -0.8   -1.1


Although this regression of percent CPU busy as a function of
the number of active address spaces achieves good results,
the same data analysis repeated with data from a different
system could achieve much more suspect results.  We alluded
to the reason for this potential variation when we introduced
the concept of the error term in an earlier paragraph.  The
error term represents potential measurement errors and/or
influences on the y variable from other independent variables
than x.

In the case of percent CPU busy, this analysis is also
influenced by the CPU absorption rate of each of the address
spaces.  For the example presented above, the rates are
similar among the various address spaces.  However, for
systems processing a wide variety of workload types,
differences in CPU absorption rates for the individual
address spaces could result in significant differences
between the actual and estimated observations.

The following sections discuss statistical estimators for
evaluating the fit of regression models:

 1 - Linear, Quadratic, and Cubic Models
 2 - Processing Historical Observations