3.3.4.1 Clustering Algorithms

3. PERFORMANCE ANALYSIS TOOLS › 3.3 Data Clustering Analysis › 3.3.4 Analytic Technique Tutorial › 3.3.4.1 Clustering Algorithms
3.3.4.1 Clustering Algorithms


Although there are a wide variety of statistical algorithms
that you can use to develop clusters, one of the simplest and
most popular algorithms is the K-means algorithm (MAC67).  A
hybrid of both the hierarchical and the K-means algorithm is
employed within Neugents technology for clustering
operations.

To understand how clustering algorithms work, you must
understand the following definitions.

Resource      An n-tuple of measurements selected to describe
Vector:       a workload element.  For example, you might
              choose to describe jobs in terms of the three
              distinct measurements: CPU time, print lines,
              and tape mounts.  Each job would then be
              represented by a 3-tuple (CPU time, print
              lines, and tape mounts).  This is also
              sometimes referred to as a usage vector, or
              simply a vector.  Neugents technology uses the
              term "pattern" when describing such a set of
              related measurements.

Feature:      A component of a resource vector.  In the above
              example, CPU time, print lines, and tape mounts
              are the three features of resource vectors
              describing the workload element, job.

Space:        The n-dimensional coordinate system in which
              resource vectors are defined.

Feature       The mean of all values of a given feature taken
Mean:         over a set of resources.

Feature       The variance (that is, the standard deviations
Variance:     squared) associated with the distribution of
              values of a given feature, taken over a set of
              resource vectors.  Note that the feature with
              the greatest variance defines the longest
              coordinate axis of the space.

Centroid:     The geometric center of the set of all resource
              vectors contained in a space.  The location of
              the centroid is defined by an n-feature vector.
              Its component values are the feature means of
              the resource vectors in the space.

Cluster:      A set of resource vectors in a given space plus
              their centroid.  Note that the centroid of a
              set of vectors need not be one of the original
              vectors in the space.

Geometric     The distance between any two points (that is,
Distance:     resource vectors) in a given space.  It is
              calculated by squaring the difference between
              the points (that is, the feature values) for
              each feature, summing the squares over all
              features, and taking the square root of the
              resulting sum.  The formula for a two-dimension
              geometric distance is better known as the
              Pythagorean theorem.  A formula for calculating
              geometric distance is presented in the
              following first equation.

Radius:       The greatest geometric distance between a
              centroid and any of its associated resource
              vectors.

Relative      The ratio of the geometric distance between a
Position:     given resource vector or pattern and the radius
              of the cluster, expressed as a value from 0.00
              to 1.00.  The lower the value of the relative
              position, the closer the particular resource
              vector or pattern is to the cluster center.

Objective     An externally defined parameter that specifies
Radius:       a maximum desired radius for clusters
              identified by a clustering algorithm.

Splitting:    The partitioning of a space into two smaller
              spaces.  A space is split in half along the
              axis of greatest variance.  A clustering
              algorithm will continue to split resource
              vector spaces as long as the radius of any
              cluster in the partition is greater than the
              specified objective radius.
Consider the following two-dimensional clustering example.
Fifteen resource vectors were selected for the example. They
are shown in the following table:


              VECTOR NUM.       X       Y
              ===========     =====   =====
                   1            1       1
                   2            2       1
                   3            3       1
                   4            3       8
                   5            4       7
                   6            4       8
                   7            4       9
                   8            5       8
                   9            8       4
                  10            9       3
                  11            9       4
                  12            9       5
                  13           10       4
                  14           10       8
                  15           11       8


Note: The range of values selected for the two features in
this example are similar. This avoids the problems that are
introduced by scaling considerations.  Scaling is discussed
in Section 15.4.2.  The objective radius for this example is
1.


A plot of the resource vectors is shown in Figure 3-40.  The
resource vectors are represented by asterisks (*) and the
centroid of the space is represented by a pound sign (#).


    10 |
       |
     9 |               *
       |
     8 |           *   *   *                   *   *
       |
     7 |               *
       |
     6 |
       |
 Y   5 |                        #          *
       |
     4 |                               *   *   *
       |
     3 |                                   *
       |
     2 |
       |
     1 |   *   *   *
       |
     0 +------------------------------------------------
           1   2   3   4   5   6   7   8   9  10  11  12

                               X

 Figure 3-40.  Initial Space
The first step in the solution process is to calculate the
feature means, variances, and radius for the space.  (The
centroid is defined by the feature means.) The calculated
values are as follows:


                          CENTROID
      SPACE    FEATURE     (MEAN)    VARIANCE   RADIUS
      =====   =========   ========   ========   ======
       all        X         6.13       11.41      6.67
                  Y         5.27        8.50


As mentioned previously, the radius of the space is
calculated using the geometric distance equation.  This
equation states that the distance, D, between the centroid of
the k-th cluster, C(k), and the i-th resource vector, R(i),
is equal to the square root of the sum of the squares of
their feature differences.  The general form of the geometric
distance equation is shown in Equation 1, where j is the
feature number index.


              _                             _
             |       _                    _  | 1/2
             |    n |                    2 | |
         D = | SUM  | ( C(k,j) - R(i,j) )  | |        (Eqn 1)
             |  j=1 |_                    _| |
             |_                             _|

For example, the farthest resource vector from the centroid
of the space shown in Figure 3-40 is the resource vector at
(1,1).  The calculation of the distance between the resource
vector and the centroid is shown in Equation 2.


              _                         _
             |           2             2 | 1/2
         D = | (6.13-1.0)  + (5.27-1.0)  |            (Eqn 2)
             |_                         _|
              _             _
             |               | 1/2
         D = | 26.32 + 18.23 |      =  6.67
             |_             _|


As shown in the previous table, the variance in the X
direction is 11.4 and the variance in the Y direction is
8.50.  Since the radius of the space is greater than the
objective radius, the space is split at the centroid along
the axis of greatest variance.  This results in two spaces, A
and B, as is shown in Figure 3-41.


    10 |
       |   A                    |   B
     9 |               *        |
       |                        |
     8 |           *   *   *    |              *   *
       |                        |
     7 |               *        |
       |                        |
     6 |                        |
       |            #           |            #
 Y   5 |                        |          *
       |                        |
     4 |                        |      *   *   *
       |                        |
     3 |                        |          *
       |                        |
     2 |                        |
       |                        |
     1 |   *   *   *            |
       |
     0 +------------------------------------------------
           1   2   3   4   5   6   7   8   9  10  11  12

                               X


 Figure 3-41.  First Split

The characteristics of the two spaces resulting from the
split are shown in the following table.


                          CENTROID
      SPACE    FEATURE     (MEAN)    VARIANCE   RADIUS
      =====   =========   ========   ========   ======
        A         X         3.25        2.70      4.92
                  Y         5.38       13.41

        B         X         9.33        1.07      3.31
                  Y         5.14        4.14


Once again, the radii of the spaces are greater than the
objective radius.  Hence, each of the spaces is split along
the axis of greatest variance.  This results in four spaces,
A, B, C, and D, as shown in Figure 3-42.  Note that in spaces
A, C, and D the centroids happen to coincide with a resource
vector.


    10 |
       |   A                    |   B
     9 |              *         |
       |                        |
     8 |          *   #   *     |              * # *
       |                        |
     7 |              *         |
       |                        |
     6 |                        |
       | ---------------------- | ---------------------
 Y   5 |                        |          *
       |   C                    | D
     4 |                        |      *   #   *
       |                        |
     3 |                        |          *
       |                        |
     2 |                        |
       |                        |
     1 |   *   #   *            |
       |
     0 +------------------------------------------------
           1   2   3   4   5   6   7   8   9  10  11  12

                               X


 Figure 3-42.  Second Split

The characteristics of the four spaces resulting from the
split are shown in the following table.


                          CENTROID
      SPACE    FEATURE     (MEAN)    VARIANCE   RADIUS
      =====   =========   ========   ========   ======
        A         X          4.0        0.50      1.0
                  Y          8.0        0.50

        B         X         10.5        0.50      0.5
                  Y          8.0        0.00

        C         X          2.0        1.00      1.0
                  Y          1.0        0.00

        D         X          9.0        0.50      1.0
                  Y          4.0        0.50


Since all of the radii are less than or equal to the
objective radii, a solution for the space introduced in
Figure 3-42 has been developed.  Although this problem is
very simplistic, it illustrates the concepts of the K-mean
algorithm.

Thus, four clusters have been identified that represent the
15 resource vectors in the space.  Although the actual
implementation of the clustering algorithm is somewhat more
complex, the same principles apply when thousands of workload
elements are analyzed with the program.