11.4.1 Clustering Algorithms

11. WORKLOAD CHARACTERIZATION › 11.4 Analytic Technique Tutorial › 11.4.1 Clustering Algorithms
11.4.1 Clustering Algorithms


Although there are a wide variety of statistical algorithms
that you can use to develop clusters, one of the simplest and
most popular algorithms is the K-means algorithm (MAC67).  A
variation of the K-means algorithm is employed in the SAS
FASTCLUS procedure used by the CA MICS Capacity Planner
(SAS82b).  To understand how clustering algorithms work, you
must understand some additional definitions:

Resource      An n-tuple of measurements selected to describe
Vector:       a workload element.  For example, you might
              choose to describe jobs in terms of the three
              distinct measurements: CPU time, print lines,
              and tape mounts.  Each job would then be
              represented by a 3-tuple (CPU time, print
              lines, and tape mounts).

Feature:      A component of a resource vector.  In the above
              example, CPU time, print lines, and tape mounts
              are the three features of resource vectors
              describing the workload element, job.

Space:        The n-dimensional coordinate system in which
              resource vectors are defined.

Feature       The mean of all values of a given feature taken
Mean:         over a set of resources.

Feature       The variance (that is, the standard deviations
Variance:     squared) associated with the distribution of
              values of a given feature, taken over a set of
              resource vectors.  Note that the feature with
              the greatest variance defines the longest
              coordinate axis of the space.

Centroid:     The geometric center of the set of all resource
              vectors contained in a space.  The location of
              the centroid is defined by an n-feature vector.
              Its component values are the feature means of
              the resource vectors in the space.

Cluster:      A set of resource vectors in a given space plus
              their centroid.  Note that the centroid of a
              set of vectors need not be one of the original
              vectors in the space.

Geometric     The distance between any two points (that is,
Distance:     resource vectors) in a given space.  It is
              calculated by squaring the difference between
              the points (that is, the feature values) for
              each feature, summing the squares over all
              features, and taking the square root of the
              resulting sum.  The formula for a two-dimension
              geometric distance is better known as the
              Pythagorean theorem.  A formula for calculating
              geometric distance is presented in Equation 1
              below.

Radius:       The greatest geometric distance between a
              centroid and any of its associated resource
              vectors.

Objective     An externally defined parameter which specifies
Radius:       a maximum desired radius for clusters
              identified by a clustering algorithm.

Splitting:    The partitioning of a space into two smaller
              spaces.  A space is split in half along the
              axis of greatest variance.  A clustering
              algorithm will continue to split resource
              vector spaces as long as the radius of any
              cluster in the partition is greater than the
              specified objective radius.

Consider the following two-dimensional clustering example.
Fifteen resource vectors were selected for the example. They
are shown in the following table:


              VECTOR NUM.       X       Y
              ===========     =====   =====
                   1            1       1
                   2            2       1
                   3            3       1
                   4            3       8
                   5            4       7
                   6            4       8
                   7            4       9
                   8            5       8
                   9            8       4
                  10            9       3
                  11            9       4
                  12            9       5
                  13           10       4
                  14           10       8
                  15           11       8


Note: The range of values selected for the two features in
this example are similar to avoid the problems that are
introduced by scaling considerations.  Scaling is discussed
in Section 11.4.2.  The objective radius for this example is
1.

A plot of the resource vectors is shown in Figure 11-5.  The
resource vectors are represented by asterisks (*) and the
centroid of the space is represented by a pound sign (#).


    10 |
       |
     9 |               *
       |
     8 |           *   *   *                   *   *
       |
     7 |               *
       |
     6 |
       |
 Y   5 |                        #          *
       |
     4 |                               *   *   *
       |
     3 |                                   *
       |
     2 |
       |
     1 |   *   *   *
       |
     0 +------------------------------------------------
           1   2   3   4   5   6   7   8   9  10  11  12

                               X


 Figure 11-5.  Initial Space

The first step in the solution process is to calculate the
feature means, variances, and radius for the space.  (The
centroid is defined by the feature means.) The calculated
values are as follows:


                          CENTROID
      SPACE    FEATURE     (MEAN)    VARIANCE   RADIUS
      =====   =========   ========   ========   ======
       all        X         6.13       11.41      6.67
                  Y         5.27        8.50


As mentioned in the definitions above, the radius of the
space is calculated using the geometric distance equation.
This equation states that the distance, D, between the
centroid of the k-th cluster, C(k), and the i-th resource
vector, R(i), is equal to the square root of the sum of the
squares of their feature differences.  The general form of
the geometric distance equation is shown in Equation 1, where
j is the feature number index.


              _                             _
             |       _                    _  | 1/2
             |    n |                    2 | |
         D = | SUM  | ( C(k,j) - R(i,j) )  | |        (Eqn 1)
             |  j=1 |_                    _| |
             |_                             _|

For example, the farthest resource vector from the centroid
of the space shown in Figure 11-5 is the resource vector at
(1,1).  The calculation of the distance between the resource
vector and the centroid is shown in Equation 2.


              _                         _
             |           2             2 | 1/2
         D = | (6.13-1.0)  + (5.27-1.0)  |            (Eqn 2)
             |_                         _|
              _             _
             |               | 1/2
         D = | 26.32 + 18.23 |      =  6.67
             |_             _|


As shown in the previous table, the variance in the X
direction is 11.4 and the variance in the Y direction is
8.50.  Since the radius of the space is greater than the
objective radius, the space is split at the centroid along
the axis of greatest variance.  This results in two spaces, A
and B, as is shown in Figure 11-6.


    10 |
       |   A                    |   B
     9 |               *        |
       |                        |
     8 |           *   *   *    |              *   *
       |                        |
     7 |               *        |
       |                        |
     6 |                        |
       |            #           |            #
 Y   5 |                        |          *
       |                        |
     4 |                        |      *   *   *
       |                        |
     3 |                        |          *
       |                        |
     2 |                        |
       |                        |
     1 |   *   *   *            |
       |
     0 +------------------------------------------------
           1   2   3   4   5   6   7   8   9  10  11  12

                               X


 Figure 11-6.  First Split

The characteristics of the two spaces resulting from the
split are shown in the table below:


                          CENTROID
      SPACE    FEATURE     (MEAN)    VARIANCE   RADIUS
      =====   =========   ========   ========   ======
        A         X         3.25        2.70      4.92
                  Y         5.38       13.41

        B         X         9.33        1.07      3.31
                  Y         5.14        4.14


Once again, the radii of the spaces are greater than the
objective radius.  Hence, each of the spaces is split along
the axis of greatest variance.  This results in four spaces,
A, B, C, and D, as shown in Figure 11-7.  Note that in spaces
A, C, and D the centroids happen to coincide with a resource
vector.


    10 |
       |   A                    |   B
     9 |              *         |
       |                        |
     8 |          *   #   *     |              * # *
       |                        |
     7 |              *         |
       |                        |
     6 |                        |
       | ---------------------- | ---------------------
 Y   5 |                        |          *
       |   C                    | D
     4 |                        |      *   #   *
       |                        |
     3 |                        |          *
       |                        |
     2 |                        |
       |                        |
     1 |   *   #   *            |
       |
     0 +------------------------------------------------
           1   2   3   4   5   6   7   8   9  10  11  12

                               X


 Figure 11-7.  Second Split

The characteristics of the four spaces resulting from the
split are shown in the table below:


                          CENTROID
      SPACE    FEATURE     (MEAN)    VARIANCE   RADIUS
      =====   =========   ========   ========   ======
        A         X          4.0        0.50      1.0
                  Y          8.0        0.50

        B         X         10.5        0.50      0.5
                  Y          8.0        0.00

        C         X          2.0        1.00      1.0
                  Y          1.0        0.00

        D         X          9.0        0.50      1.0
                  Y          4.0        0.50


Since all of the radii are less than or equal to the
objective radii, a solution for the space introduced in
Figure 11-7 has been developed.  Although this problem is
very simplistic, it illustrates the concepts of the K-mean
algorithm.

Thus, four clusters have been identified that represent the
15 resource vectors in the space.  Although the actual
implementation of the clustering algorithm is somewhat more
complex, the same principles apply when thousands of workload
elements are analyzed with the program.