Previous Topic: 9.6 Case StudiesNext Topic: 9.6.1.1 Control Parameters


9.6.1 Capture Ratio Case Study


This case study presents one of several methods for examining
CPU capture ratios.  The method presented here probably does
not produce exactly the same results as another method,
largely because capture ratio determination is an inexact
science.

In performance analysis and capacity planning, the basis of
any study is the examination of measured data.  In this
discussion, assume that the system being measured is a
processor running the MVS operating system (although the
underlying theory is independent of the type of operating
system).  In an MVS system, basic measurement data is usually
derived from SMF and RMF data, which you store in the CA MICS
Batch and Operation Analyzer (SMF) and Hardware and SCP
Analyzer (RMF) areas respectively.  For this case study we
will narrow our discussion to RMF data and consider the
differences between data stored in the MVS Performance Group
Activity (SCPPGA) File and the CPU Processor Activity
(HARCPU) File.


DEVELOPMENT OF CONCEPTUAL MODEL

If you sum all the processor busy time elements (as stored in
the CPUxBTM variables of the CA MICS CPU Processor Activity
(HARCPU) File) for a specific processor over a specific time
interval, you would arrive at a value of CPU seconds which
you could call T - for total CPU time.  And if you add all
the CPU time (as stored in the CA MICS data element PGACPUTM
in the CA MICS MVS Performance Group Activity (SCPPGA) File)
for all performance groups over the same interval on the same
processor, this amount would add up to a second number of CPU
seconds which you could call A - for accountable CPU time.

In capacity planning and performance analysis, the value T is
desirable, for it represents every second of the interval
during which the processor is busy performing some function.
Therefore, this number represents the utilized capacity of
the processor.  Unfortunately, the number T is a very gross
value in that it simply measures how often the processor is
busy, but not what or whose work is being processed.  This is
a function of the input data for the HARCPU File (the RMF
Type 70 record), which contains no detailed analysis of the
workload.

On the other hand, the value A, which is derived from the
SCPPGA File, contains a substantial amount of detail
concerning what workload is being processed and how much
resource is required to process it.  This detail is provided
by performance group, which is one of the principle methods
for subdividing workloads in MVS systems, and which is
provided as input to the CA MICS system through the RMF Type
72 records.  The value A is also very desirable in capacity
planning and performance analysis studies, for this data can
pinpoint who is doing the work, not just how much work is
being done.

If the values of A and T are equal over the exact same time
intervals, there would be no problem or need for further
analysis.  The numbers are, in fact, not equal.  This
inequality is the basis for all the mystery surrounding the
analysis of capture ratios.  It is always the case that A is
less than T, and usually by a substantial margin.  You can
formulate this situation using the equation below:

               T  =  A  +  U                          (Eqn 6)

In Equation 6, you can think of the variable U as the
uncaptured CPU time.  That is, this is the number of CPU
seconds during the interval that were unaccounted for in any
performance group that is active on the system.
Nevertheless, these CPU seconds are expended doing services
for active tasks.  Examples of such unaccounted services
would be certain paging and swapping operations, a certain
amount of SVC processing, a certain amount of I/O operations,
etc.  See the data element $SMFCPU for a more detailed
discussion of accountable CPU time.


APPLICATION OF CONCEPTUAL MODEL

The analysis is more complicated than Equation 6 leads you to
believe.  The accountable time (A) is not viewed as a single
number, but as separate accountable CPU times for a certain
number of workloads that subdivide the entire processor
workload.  Consider the following example.

The system being examined is SYS1 (this is the CA MICS SYSID
value).  This system processes a varied workload, which can
be divided by performance group.  The workloads are a TSO
subsystem, an IMS subsystem, a CICS subsystem, and a BATCH
subsystem.  There are also system-related tasks that must
execute to provide an array of common services, such as GRS,
JES2, VTAM, TCAS, and RMF.

The following performance group assignments are made in the
IEAIPS01 member of SYS1.PARMLIB:

    TSO: 2, 40

    BATCH: 1, 3, 16, 23

    CICS: 9

    IMS: 6, 11, 14, 18

    SYSTEM TASKS: 4, 5, 7, 8, 10, 12

(The fact that these performance groups do not define a
contiguous group is of no concern.  The performance groups
listed above constitute ALL performance groups currently
running on SYS1--not including performance group 0.)

Previously, the capacity planner established two resource
element files to contain CPU time information concerning
these groups and processor use as a whole.  These files are
called:

    TOT - Total Usage File
    WKL - Workload Usage File

In our example, the total CPU time (which we called T above)
is represented in the TOT file by the data element TOTCPUTM.
The accountable CPU time is broken into several subsystem CPU
times.  These are represented in the WKL file UVWKLC01 by the
elements TSOCPUTM, BATCPUTM, IMSCPUTM, CICCPUTM, and
SYSCPUTM.  The sum of these five elements is the accountable
CPU time (which we called A above).

The formula is not the simple one of Equation 6, but the
slightly more complex one of Equation 7:


TOTCPUTM  =  TSOCPUTM  +  BATCPUTM  +  IMSCPUTM  + CICCPUTM +

                      SYSCPUTM + U                    (Eqn 7)


IMPACT OF UNCAPTURED SYSTEM TIME

As we noted previously, U, the uncaptured CPU time, is due to
system functions such as paging, swapping, SVC processing,
etc.  Such functions are not used in equal proportions by all
MVS subsystems.  For example, TSO tends to use
paging/swapping services more heavily than do batch or IMS.
And TSO, IMS, and CICS tend to be heavier SVC users than
batch.  On the other hand, batch tends to use I/O more
heavily than online systems.

Consequently, the proportion of the value U that is
contributed by the TSO subsystem may be much different from
that contributed by the IMS system and this is similarly true
of all the subsystems.  Furthermore, the amounts of U
contributed by each subsystem are also completely dependent
upon the configuration of the individual MVS system.  For
example, subsystems that are heavily storage-fenced
contribute much less to the amount U than unfenced subsystems
do; processors with expanded storage contribute different
amounts to the U value than the same subsystem on a
comparable processor using DASD paging/swapping systems; and
the MVS release can cause differences in the apportionment.

Thus, even Equation 7 does not adequately describe the
situation, and so we need to formulate Equation 8:


   TOTCPUTM = m1 * TSOCPUTM + m2 * BATCPUTM + m3 * IMSCPUTM

        + m4 * CICCPUTM + m5 * SYSCPUTM + b           (Eqn 8)


The values m1 ...  m5 are multipliers that are applied to the
various CPU times.  The difference (for example, (m1 *
TSOCPUTM) - TSOCPUTM) is the amount of uncaptured CPU time
that has been contributed by the subsystem.  One can see that
this capture ratio problem has now been stated in the form of
a multilinear equation, which will therefore lend itself to
solution by multilinear regression.

The values m1 through m5 are sometimes called the
multipliers.  Their inverses are called capture ratios, that
is, the value 1/m1 is the TSO capture ratio for the TSO
system, while 1/m4 is the CICS capture ratio, and so on.  The
reason that 1/m1 (for example) is referred to as the capture
ratio is that it can be interpreted as the proportion of
total TSO CPU utilization accounted for in the CA MICS MVS
Performance Group Activity (SCPPGA) File.

From the above discussion, you can draw some conclusions
about the values m1, ..., m5 and b.  First, the values m1,
..., m5 should all be positive values.  Furthermore, from our
discussion in the previous two paragraphs, you can see that
the values m1, ..., m5 must be greater than 1.  These are
important observations, for if a model that we are evaluating
results in m values that are less than 1, this model will
fail our third test - the common sense test.


INTERPRETATION OF MODEL

Sections 9.6.1.1 through 9.6.1.3 explain each of the reports
in our case study.  These reports represent the final model
that is evaluated after several preliminary attempts are
discarded.  In the preliminary models, the variable SYSCPUTM
is retained as an independent variable in the model, which
always results in an m value for SYSCPUTM less than 1.
Finally, SYSCPUTM was eliminated as an independent variable,
which resolves that problem.

Why did this situation happen?  SYSCPUTM represents the CPU
utilization of all the common system services on the
processor.  As it turns out, the values of SYSCPUTM are
related to the values of the other independent variables in a
linear fashion.  This situation, in which one or more of the
independent variables are related to each other by a
basically linear relationship, is termed colinearity, and
invariably has a bad effect on multilinear regression models.
It usually reveals itself as a model that fails test 3 - the
common sense test.  You must be cautious of hidden linear
relationships between the independent variables.  When you
delete such relationships, either combine related workloads
into a single "lump" workload, or drop one of the workloads
altogether, which is the method we use in our example.


PROBLEMS OF INTERPRETATION

Think of value b as a "noise" factor (that is, the amount of
idle CPU time within the interval in the operating system).
The situation when the value of b is rather large for certain
operating system releases is sometimes referred to as the low
utilization effect.  This means that at low CPU utilization
rates, the operating system tends to consume a large amount
of CPU cycles, although its task is only to search for more
work to do.

Because the value of b represents the amount of CPU busy time
consumed in a system with no load upon it, this value can
sometimes be unreliable when determined by multilinear
regression techniques.  This is because most production
systems rarely, if ever, experience periods with no load upon
them (systems with no load for substantial periods of time
are usually halted for economic reasons).  Therefore, it is
unlikely that there will be many, if any, historical data
points representing a no-load period.  At low utilization,
total CPU busy time tends to be a non-linear function of the
subsystem CPU times (for example, BATCPUTM, TSOCPUTM), and
therefore a linear model may badly estimate the value b while
providing very good predicted values for total CPU use at
higher utilization levels.  In short, do not blindly trust
the value b determined in a multilinear regression model
without more investigation.

Figure 9-8 illustrates a typical case of low utilization
effects.  The asterisks (*) illustrate the placement of
historical data -- mostly in the upper utilization areas --
while the plus signs (+) illustrate the points of the linear
regression model.  The point labeled b shows where the
regression predicts the intercept to be, while the point
labeled true b shows where the real utilization curve
intersects the Y axis.  The true utilization curve departs
from the predicted linear model at around 20% utilization in
our illustration, and runs along the curve depicted with
minus signs (-).



   100 +                              *     +
       |                          *      +  *    *
       |                    *   *   *+  *  *   *
       |                          +    *  *
    T  |                    *  +   *    *   *  *
       |             *  *   +  *      *
       |                 +  *
       |          ----+  *
 true  |      ---- +
  b->  -------  +
       |     +
       |  +
  b -> +
       |
     0 +------------------------+-----------------------+
       0                       50                      100
                   A

 Figure 9-8.  The Low Utilization Effect


PREDICTIONS OUTSIDE THE RANGE OF MEASUREMENT DATA

The problem introduced by low utilization effects is really a
specific instance of a more widespread problem in
multivariate regression analysis.  This is the problem of
predicting values of the dependent variable outside the range
of the independent variables.  The most basic underlying
assumption in the use of multilinear regression analysis is
that the relationship exhibited between the dependent and
independent variables is linear.  While this may be
essentially true in the range of the independent variables,
it may not be true outside of that range.  The technique of
multivariate regression has no way of predicting, or even
warning, of potential non-linearity outside of the range of
the variables.  It is your own experience and knowledge that
must provide the guidelines in these situations.

Now that we have expanded on some of the theory of capture
ratios, and have defined the case study, the next sections
explore the results of the reports that are generated from
the multilinear regression model we developed for our capture
ratio study:

     1 - Control Parameters
     2 - Model Analysis Report
     3 - Residual Analysis Report