7.4.2.1 Deleting Errant Observations

7. UNIVARIATE MODEL FORECASTING › 7.4 Analytic Technique Tutorial › 7.4.2 Processing Historical Observations › 7.4.2.1 Deleting Errant Observations

7.4.2.1 Deleting Errant Observations


The following table shows average CPU utilizations collected
for a 3090-200J processor, and illustrates the problems that
often occur with historical observations:


              Week     Observation     %  CPU
             Ending      Number         BUSY
            =======    ===========     ======
            31OCT97         1           71.0
            07NOV97         2           72.0
            14NOV97         3           72.2
            21NOV97         4           73.8
            28NOV97         5           62.5
            05DEC97         6           74.0
            12DEC97         7           75.2
            19DEC97         8           75.0
            26DEC97         9           53.7
            02JAN98        10           61.0
            09JAN98        11           76.4
            16JAN98        12           78.0


Figure 7-4 shows a scatter plot of the data.  A linear
regression model developed for this historical CPU
utilization data has the following parameters:

    n  =     12, the number of historical observations

    b  =  62.70, the y intercept

    m  =   1.17, the slope of the line

     2
    r  =   0.25, the coefficient of determination

    F  =   0.02, the F value

    p  =   0.90, the probability that we should reject the
                 hypothesis

    s  =   6.68, the standard error
     e

The predicted and residual values for the historical data
series are shown in the following table:

      Week     Observation     %  CPU      Est      Residual
     Ending      Number         BUSY      % CPU     (error)
    =======    ===========     ======     =====     ========
    31OCT97         1           71.0       63.9        7.1
    07NOV97         2           72.0       65.1        9.9
    14NOV97         3           72.2       66.2        6.0
    21NOV97         4           73.8       67.4        6.4
    28NOV97         5           62.5       68.6       -6.1
    05DEC97         6           74.0       69.8        4.2
    12DEC97         7           75.2       70.9        4.3
    19DEC97         8           75.0       72.1        2.9
    26DEC97         9           53.7       73.3      -19.6
    02JAN98        10           61.0       74.5      -13.5
    09JAN98        11           76.4       75.6        0.8
    16JAN98        12           78.0       76.8        1.2


As you can see in the model parameters and residual values in
this table, the proposed model fits the historical data very
poorly.  In many cases, these problems are introduced by
poorly behaved historical data rather than by the type of
model selected by the analyst.  In this example, three
observations in the historical data (28NOV97, 26DEC97, and
02JAN98) are significantly different from the remainder of
the historical data points.  Investigation reveals that these
three weeks represent holidays, presenting two alternatives:

o  Compensating the historical data points.  For example, you
   could attempt to compensate for the missing data by
   multiplying by some constant.  Unfortunately, such
   constants are guesses made by the analyst.  Therefore, we
   do not recommend that you compensate historical data.

o  Deleting the errant historical data points.  Although this
   reduces the number of points available for developing the
   model, it does not introduce any of the analyst's biases
   into the modeling process and is statistically defensible,
   since these weeks really do represent a different category
   of work for the processor.

Deleting the historical observations for the holiday weeks
results in a substantially better model, giving significantly
improved parameters.  The parameters of the model are shown
below:

    n  =      9, the number of observations

    b  =  70.78, the y intercept

    m  =   0.57, the slope of the line

     2
    r  =   0.93, the coefficient of determination

    F  =    162, the F value

    p  = 0.0001, the probability that we should reject the
                 hypothesis

    s  =   0.64, the standard error
     e

The predicted and residual values for the model that is
developed from the historical series with the three holiday
weeks deleted are shown in the following table.


      Week     Observation     %  CPU      Est      Residual
     Ending      Number         BUSY      % CPU     (error)
    =======    ===========     ======     =====     ========
    31OCT97         1           71.0       71.4       -0.4
    07NOV97         2           72.0       71.9       -0.1
    14NOV97         3           72.2       72.5        0.3
    21NOV97         4           73.8       73.0       -0.8
    28NOV97         5            .         73.6        .
    05DEC97         6           74.0       74.2       -0.2
    12DEC97         7           75.2       74.7        0.5
    19DEC97         8           75.0       75.3       -0.3
    26DEC97         9            .         75.9        .
    02JAN98        10            .         76.4        .
    09JAN98        11           76.4       77.0        0.6
    16JAN98        12           78.0       77.6       -0.4

The model developed from the historical data series after the
three holiday weeks were deleted is significantly better than
the model developed before this deletion.  This example show
the value of deleting errant historical data points.  Note
that the WEEKS timespan is probably more attractive for
building models since there are often too few monthly
observations for deletion to be an attractive alternative if
the MONTHS timespan is used.

HOLIDAY CPU DATA | | 81 + | | | 78 + * | | * | 75 + * * | * | * | 72 + * * |* % | | 69 + C | P | U | 66 + | B | U | S 63 + Y | * | | * 60 + | | | 57 + | | | 54 + * | -+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+ 1 2 3 4 5 6 7 8 9 10 11 12 OBSERVATION NUMBER

 
 Figure 7-4.  Weekly CPU Utilizations