Previous Topic: 11.4.1 Clustering AlgorithmsNext Topic: 11.4.3 Algorithm Implementation


11.4.2 Scaling


One problem that is introduced by the use of statistical
clustering algorithms is scaling.  In the clustering example
discussed in the previous section, we were careful to select
a similar numerical range for the X and Y variables.
Unfortunately, actual workload data presents a much wider
range of values.

Consider the problem of developing job classes.  The job
classes in this example are based on CPU minutes and print
lines.  The maximum value observed for CPU minutes would be
far less than 1,000 minutes, while the likelihood of printing
more than 1,000 lines is very high.  If the differences in
these features are squared in the geometric distance
equation, the number of CPU minutes used by a job would
appear to be insignificant.  Therefore, the variables must be
scaled prior to the clustering process so that relative
differences between the features have a similar influence on
the geometric distance calculation.

A number of different scaling techniques can be employed. Two
of the most popular are unit scaling and Z statistic scaling.


UNIT SCALING

Unit scaling is a simple technique for solving the problems
introduced by the different range of values associated with
each of the resource vector's features.  Most simply stated,
unit scaling maps the range of values associated with each
feature into the range 0 to 1.  Equation 3 details this
mapping:

                  x  - x
                   i    min
          s  = ---------------                        (Eqn 3)
           i     x    - x
                  max    min

          where: s     is the scaled value of x
                  i                            i

                 x     is i-th observation of the x feature
                  i

                 x     is the minimum of all x
                  min                         i

                 x     is the maximum of all x
                  max                         i

Although this transformation solves the problems introduced
by the different ranges of values associated with the
features, it does not provide any assistance in estimating a
value for the objective radius.


Z STATISTIC SCALING

Z statistic scaling is another technique that has been
employed in workload characterization studies.

Equation 4 details Z statistic scaling:


                  x  - x
                   i    bar
          z  = ---------------                        (Eqn 4)
           i        x
                     std

      where:  z      is the Z statistic for x .
               i                             i

              x      is i-th observation of the x feature.
               i

              x      is the mean value of the x
               bar      distribution.

              x      is the standard deviation of the x
               std      distribution.


The Z statistic transforms the range of values to the range
-3.5 to +3.5.  In other words, the Z statistic transforms
each x observation into the number of standard deviations
that it lies from the mean value.  Since a standard deviation
can be considered as a measure of a "significant difference"
with a distribution, you can establish the value for the
objective radius in terms of standard deviations.  Since the
Z statistic solves both the problem of scaling the features
and selecting a meaningful value for the objective radius, it
was selected for use in the workload characterization
approach implemented in the CA MICS Capacity Planner.

One other complication related to scaling is the presence of
outliers in the feature distributions.  With workload
measurement data, it is not uncommon to find that one or two
observations out of every 10,000 are an order of magnitude
different than any of the other observations in the
distribution.  Although the presence of these few
observations changes the mean value by only a few percentage
points, they have a profound effect on the standard
deviation.  If such an exaggerated standard deviation value
is used in the scaling process, then the body of the data
would be crunched so that the outliers could be properly
scaled.

To correct the problem introduced by outliers, it is
necessary to exclude the outliers from the calculation of the
population statistics, that is, the mean and standard
deviation.  Previous investigators have trimmed from 1 to 5
percent of the observations from the right-hand tail of the
distribution.  By default, the workload characterization
program trims 2.5 percent of the observations from the
right-hand side of the distribution prior to the calculation
of the population statistics.  It is important to note that
these observations are not excluded from the workload
characterization study, but are simply excluded from the
calculation of the population statistics.

When the feature observations are scaled using the trimmed
mean and standard deviation values, 97.5 percent of the
values are scaled between -3.5 and +3.5.  The values that are
excluded from the calculation of the population statistics
are scaled to values greater than +3.5.  It is very likely
that if any of the resource vectors are poorly represented by
the characterization of the workload developed, it will be
one of the resource vectors that contains a feature that is
excluded from the calculation of the population statistics.

The potentially poor representation of outliers is one of the
primary concerns in the selection of features for the
clustering process.  Criteria for selecting features for
clustering are discussed in Section 11.2.4.