Previous Topic: 2.2.1 Data Analysis

Next Topic: 2.2.3 Analyzer Reporting Options

2.2.2 Performance Analysis


The VM control program views the real machine as consisting
of four subsystems:  CPU, storage, paging, and I/O. The
performance of the total system and of individual virtual
machines is determined by the utilization of these
subsystems.  Understanding significant measurements of these
subsystems and the relationships among them is the first step
in analyzing VM performance.

VM controls the subsystems of the real machine, distributing
them to the workload as required.  The workload in a VM
system consists of the applications and operation systems of
the virtual machines that are logged on to real machines at
any given moment.  To evaluate VM performance, it is first
important to characterize the workload of the virtual
machines and to understand the basic elements of CPU
utilization.


WORKLOAD DISTRIBUTION

A workload profile can be obtained by looking at a count of
active users by hour (Figure 2-1).  An active user is a
virtual machine that has consumed some CPU time during the
measurement interval.  (The default interval is 60 seconds,
and the intervals are summarized to show the count by hour.)
The active user count shows the workload that existed during
the measurement interval.



          A    200  ______________________________
          C        |-                             |
          T        |-                             |
          I        |-                /|           |
          V        |-               /  \          |
          E    100 |-   - - - - - - - - - - - - - |
                   |-        / \  /      |        |
                   |-       /   \/       \        |
          U        |-      /              |       |
          S        |-     /               \       |
          E      0 |-____/                 \___   |
          R        |_|____|____|____|____|________|
          S          0   5   10    15    20

                               HOUR



 Figure 2-1.  Active User Count by Hour

Typically, there will be periods of very few active users
(where the system resource consumption is very low).
Alternatively, there may be periods of high demand, with many
active users.  As the workload increases (periods of high
demand begin to rise), the count of active users can be
plotted against other variables to determine where
bottlenecks occur, as we will see in the following sections.


CPU UTILIZATION

The utilization of the CPU is quantified by how busy the CPU
was during the measurement interval.  In a VM system, the CPU
utilization measurement known as "total CPU" is made up of
two elements:  virtual time and overhead time. Virtual time
is that portion of the total time that is directly consumed
by the virtual machines(s).  Overhead time consists of the
time that is not attributable to any virtual machine (such as
I/O simulation, interrupt processing, and time consumed on
behalf of a virtual machine).  The portion of the interval
that the CPU was not performing any work is known as a "wait"
state.

Figure 2-2 shows an example of total and virtual CPU
utilization plotted against active users.  The figure shows
the load capacity of the system.  Saturation occurs at the
point at which the workload is increasing.  This occurs
because the overhead that is needed to support the workload
consumes those CPU cycles which otherwise would have been
consumed by the virtual machine(s).

          100 ____________________________
              |-                         |
    %      90 |-  +----------+         * |
              |-  | LEGEND   |        *  |
    C      80 |-  | * Virtual|       *   |
    P         |-  | o Total  |   * *     |
    U      70 |-  +----------+ *         |
              |-              *          |
    U      60 |-             *           |
    T         |-           *             |
    I      50 |-         *               |
    L         |-                         |
    I      40 |-       *        o        |
    Z         |-             o     o     |
    A      30 |-      *    o             |
    T         |-          o          o   |
    I      20 |-    *   o              o |
    O         |-     o                   |
    N      10 |- * o                     |
              |- o                       |
            0 |__________________________|

                 0         50         100

                   ACTIVE USERS



 Figure  2-2.  Total and Virtual CPU Utilization
                     vs. Active Users



Once you have determined your VM system's workload and its
load capacity, you can go on to pinpoint specific subsystems
that are bottlenecked.  You can do this by producing plots of
contention indicators for the four subsystems (CPU, storage,
paging, and I/O) against workload.  This will show not only
which resource bottlenecks first, but can aid you in
determining where you can obtain the most improvement for the
cost.

This section includes sample plots for two of the subsystems.
Plots for the other subsystems can be generated using the
guidelines discussed here.  Note that the detection of
subsystem bottlenecks should be based both on factual
analysis (as we will show in the following sections) and on
familiarity with normally acceptable conditions.

CPU CONTENTION

Contention for a system resource such as the CPU can be
quantified as the percentage of active users who are waiting
for the processor plotted against total CPU utilization
(Figure 2-3).  This figure indicates that throughout most of
the range of CPU utilization, the percentage of active users
waiting for the CPU was relatively low (less than 5 percent)
and would generally be considered acceptable.



    %          ______________________________
           20 |-                             |
    A         |-                             |
    C      18 |-                             |
    T         |-                             |
    I      16 |-                             |
    V         |-                             |
    E      14 |-                             |
              |-                         *   |
    U      12 |-                             |
    S         |-                             |
    E      10 |- - - - - - - - - - - - -*- - |
    R         |-                             |
    S       8 |-                             |
              |-                             |
    I       6 |-                       *     |
    N         |-                      *      |
            4 |-                     *       |
    C         |-                    *        |
    P       2 |-         * * * * * *         |
    U         |-   * * *                     |
            0 |- *                           |
    W         |__|_____|_____|_____|_____|___|
    A            0     25    50    75   10
    I
    T



 Figure 2-3.  Percent Active Users Waiting for CPU
                  vs. Total CPU Utilization


At the high end of utilization, as greater proportions of
active users are waiting for this resource, it is clearly
evident that the CPU has become a bottleneck.  As the load
increases in a CPU-bound environment, most users will be
found waiting for the CPU as opposed to the other resources.

VM favors the interactive users and will continue to deliver
acceptable response.  Therefore, when the CPU cannot satisfy
the demand, the performance degradation will be felt most by
non-interactive users.


STORAGE CONTENTION

At low system utilization levels, there is usually an
adequate quantity of main storage to satisfy the workload.
As the workload increases, however, storage contention
occurs.

Sufficient main storage is essential in a VM system so the
good and consistent interactive response can be delivered.
If the available main storage is insufficient to satisfy the
demand of the workload, due to paging, virtual machines will
incur a wait before being dispatched.

Virtual machines are classified as eligible when they are
waiting.  The percentage of active users in storage wait is
the measurement that quantifies the contention for main
storage.  Plotting this measurement against the total number
of active users can aid in determining whether a storage
bottleneck exists.


PAGING CONTENTION

The paging resource in VM is treated as an extension of
storage.  When sufficient storage to handle the workload
exists, paging is typically inactive.  However, as the
workload increases, paging activity also increases.  The
speed of the primary paging devices, the number of paging
slots available, and the "robustness" of the paging resource
can have a tremendous impact on virtual machine performance.

A paging bottleneck is likely to be revealed by the number of
active users who are in page wait.  This can be observed by
plotting the page rate against the users who are waiting for
pages.  Figures 2-4 illustrates this.  Again, the saturation
point occurs where the workload (number of active users) is
increasing and the paging rate is level or decreasing.

             300  ______________________________
                 |-                             |
       P         |-                   * *       |
       A         |-                 *     *     |
       G     200 |- - - - - - - - -*- - - - * - |
       E         |-              *              |
                 |-            *                |
                 |-          *                  |
       R     100 |- - - - - * - - - - - - - - - |
       A         |-      *                      |
       T         |-    *                        |
       E       0 |-*                            |
                 |_|_____|_____|______|______|__|
                   0    50    100    150    200

                            ACTIVE USERS



 Figure 2-4.  Paging Rate vs. Active Users Waiting for Pages


In addition, systems with mixed paging devices should be
examined to determine when workload levels cause the paging
to overflow to slower devices.  When the paging overflows
from fixed-head devices to moving-head devices, there may be
interference with the normal disk I/O speed that users
experience.  A paging bottleneck in a VM system is typically
accompanied by inconsistent and painfully slow response
times.


I/O CONTENTION

The I/O resource processes virtual-machine-initiated I/O.  As
a result of this, the I/O rate is sometimes used as a
measurement of virtual machine throughput.  As the workload
increases, I/O bottlenecks can occur due to channel, control
unit, and device saturation.  In addition, configuration
imbalances also can cause I/O contention.

Plotting the I/O rate against workload can show the
saturation point at which the I/O rate decreases as the
workload increases.  Plotting the percentage of active users
in I/O wait against the active user count also can aid in
detecting I/O contention.

RELATIONSHIPS AMONG RESOURCES

The supply of, or demand for, one subsystem can affect the
utilization of others.  Removing the bottleneck for one
system by eliminating contention or decreasing the workload
can lead to an unanticipated bottleneck somewhere else.
Latent demand that is suddenly manifested can cause subsystem
resource consumption at unexpected rates (i.e., the removal
of a storage bottleneck can lead to a significant increase in
CPU utilization).


Response Indicator

Collecting and analyzing true response time measurements of a
VM system requires a lot of overhead.  The time virtual
machines spends in queue and in the eligible list waiting to
get into queue provides a good measurement of the
responsiveness of the system as a whole.

The measurements discussed here are normal measurements of a
VM system.  However, it must be understood that these
measurements are based on internal scheduler variables and
algorithms, which means that a transaction as viewed by the
scheduler will probably not be the same as a transaction as
viewed by an individual sitting at a terminal.  The scheduler
views a transaction as the work required for a virtual
machine to voluntarily drop from queue (complete the task at
hand).  The end user views a transaction as the time from the
pressing of the ENTER key to the completed response at the
terminal.  The scheduler may be required to perform several
of its transactions to complete one end user transaction.

It is important to note that scheduler algorithms change from
release to release of VM, which is how VM performance has
been improved. There have been major revisions in each
release  which means that the measured system responsiveness
may change from release to release of VM, and should be
re-measured as each new release is installed.

Finally, if your VM system uses VTAM or PVM to communicate
with terminals (local or remote), additional software is in
the response path between the end user and VM.  This
elongates the actual response time in a manner that cannot be
measured by the scheduler.


The Next Step

Performing basic performance analysis of a VM system can be
summarized as finding the answers to two questions:

1.  Is the system performing well based on past history?

2.  Which of the four subsystems -- CPU, storage, paging, or
    I/O -- seems to be bottlenecked?

Once you have identified a bottleneck, the next course of
action is to perform a more detailed analysis of the suspect
resource.