Previous Topic: 9.2.2 Choosing Model ParametersNext Topic: 9.2.4 Forecasting Future Values from the Model


9.2.3 Evaluating the Model


The second step in developing a Multivariate Regression
Forecasting model is evaluating how well the model fits the
data.  For this purpose, you can use the same basic
statistics that you use in univariate linear regression with
multivariate regression analysis.  The most commonly used
statistics are the r-squared, F, and p statistics.  The
statistics for Multivariate Regression Forecasting and
univariate linear regression are similar, although the
formula for calculating multivariate regression statistics is
more involved.

As we discussed in Chapter 7, r-squared, F, and p are
interpreted as follows:

     o  r-squared is the coefficient of determination (also
        referred to as the coefficient of multiple
        correlation).  This value indicates what percent of
        the variability in the historical data is explained
        by the proposed model.  You would typically be
        seeking r-squared values greater than or equal to
        0.7.

     o  F is a parameter calculated from the ratio of sums of
        squares terms for the models.  In general, larger F
        values indicate more reliable models.  However, the F
        value has very little significance when less than 30
        historical data points are used to develop the model.

     o  p is the probability that the F value is significant.
        A value less than or equal to 0.01 has been
        recommended for computer measurement applications
        (SAR79).  You can discount the p value if you use a
        small number of historical observations to build the
        model.

There are countless numbers of tests and statistics that aid
the sophisticated statistician in the evaluation of models.
However, there are three basic tests that you can employ.
Used in concert, these tests should enable you to evaluate
how well the models fit the data relatively quickly and with
relatively little prerequisite mathematical knowledge.


TEST 1 - Examine the r-squared, F, and p statistics.

      We recommend an r-squared value greater than 0.7.  The
      p statistic should be small (we recommend a value of
      0.01 or less).  Like most rules of thumb, these rules
      cannot be considered inviolate.

      The F value should also be large, although the size is
      relative and depends upon, among other things, the
      number of independent variables.  The F-value is used
      to calculate the p statistic, so it is probably better
      to rely upon the p statistic until you gain familiarity
      with models and their statistics.

      The r-squared, F, and p values are printed on the Model
      Analysis Report.


TEST 2 - Visually examine the data.

      Visual examination of the actual and predicted data is
      very important and can sometimes show that a model
      whose r-square, F, and p statistics look bad may be
      significantly improved by the exclusion of a small
      number of outliers from the data.

      The Multivariate Residual Analysis Report can help you
      visually examine the actual, predicted, and residual
      values.  Ideally the residuals should appear totally
      random.  If you notice a pattern in the residuals, you
      may want to reconsider your choice of independent
      variables.


TEST 3 - Apply the common sense test.

      The common sense test asks, "Does this model make any
      sense?"  To illustrate this point, consider the
      following example.  You decide to analyze a model whose
      dependent variable is CPU utilization, measured in CPU
      hours per zone per week.  The three independent
      variables are the average number of started tasks
      (STC), the average number of batch jobs (BAT), and the
      average number of TSO users (TSO) during the week.

      Suppose that a stepwise regression shows that the best
      r-squared model is

         CPU_HRS = (1.5*TSO) + (.89*BAT) + (-.23*STC) + 2.45

      For this model, the r-squared value is 0.82, the
      F-value is 323.26, and the p statistic is .0012.

      Based on these values, the model passes TEST 1 with
      flying colors.  Furthermore, since visual analysis of
      the actual and predicted data indicates that there
      appear to be no serious outliers, you can consider TEST
      2 a success.

      However, this model fails TEST 3 because the
      coefficient for the STC variable is negative.  If this
      model were truly valid, then you would want to run as
      many started tasks as possible on this processor, for
      each additional running started task reduces CPU
      utilization by 0.23 hours of CPU time per week.  Such a
      notion violates common sense, so we must discard this
      model.

      Generally, situations like the above (in which TESTs 1
      and 2 pass and TEST 3 fails miserably) most often occur
      when there is insufficient historical data included in
      the evaluation of the model.  This situation can also
      occur if there is duplication among the variables you
      use in the analysis.  For example, predicting total CPU
      utilization based on batch job, TSO session, and IMS
      transaction CPU times can yield negative coefficients
      if the IMS message region batch jobs are not excluded
      from the historical data.  In summary, always apply
      TEST 3.


ATYPICAL BEHAVIOR

When TEST 3 fails, look for atypical behavior on the system
during the measurement interval.  For example, this model
shows a negative relationship between started tasks and
overall CPU time.  Perhaps the system programmers were
testing some new online systems, and they were allowed to
test only during non-peak periods.  Because of this, they
started several online systems whenever the load on the
system dropped off.  In this case, you could isolate the test
online systems as a separate independent variable, and repeat
the analysis.

Not all problems with data are the result of atypical
workloads.  A more serious problem which causes TEST 3 to
fail is a violation of the basic assumptions of regression.
In the theoretical formation of a regression model, we assume
that the independent variables are truly independent of each
other.  This assumption is often violated by computer system
data.  In our example we use the number of TSO users, batch
jobs, and started tasks.  If TSO is used for development and
test, you would expect that increased TSO use would lead to
increased batch usage as the programmers submit more jobs.

This violation of the regression assumptions, where the
independent variables are correlated with each other, is
known as multicolinearity.  The only way to recover from data
which represents multicolinearity is to change the data to
prevent the problem.  If the TSO and batch activity are
correlated, you must remove the correlation by creating a new
variable which represents batch jobs not due to TSO.  This
variable is independent of TSO, so you may repeat this
analysis with your basic independence assumption satisfied.