

11. WORKLOAD CHARACTERIZATION › 11.4 Analytic Technique Tutorial › 11.4.1 Clustering Algorithms
11.4.1 Clustering Algorithms
Although there are a wide variety of statistical algorithms
that you can use to develop clusters, one of the simplest and
most popular algorithms is the K-means algorithm (MAC67). A
variation of the K-means algorithm is employed in the SAS
FASTCLUS procedure used by the CA MICS Capacity Planner
(SAS82b). To understand how clustering algorithms work, you
must understand some additional definitions:
Resource An n-tuple of measurements selected to describe
Vector: a workload element. For example, you might
choose to describe jobs in terms of the three
distinct measurements: CPU time, print lines,
and tape mounts. Each job would then be
represented by a 3-tuple (CPU time, print
lines, and tape mounts).
Feature: A component of a resource vector. In the above
example, CPU time, print lines, and tape mounts
are the three features of resource vectors
describing the workload element, job.
Space: The n-dimensional coordinate system in which
resource vectors are defined.
Feature The mean of all values of a given feature taken
Mean: over a set of resources.
Feature The variance (that is, the standard deviations
Variance: squared) associated with the distribution of
values of a given feature, taken over a set of
resource vectors. Note that the feature with
the greatest variance defines the longest
coordinate axis of the space.
Centroid: The geometric center of the set of all resource
vectors contained in a space. The location of
the centroid is defined by an n-feature vector.
Its component values are the feature means of
the resource vectors in the space.
Cluster: A set of resource vectors in a given space plus
their centroid. Note that the centroid of a
set of vectors need not be one of the original
vectors in the space.
Geometric The distance between any two points (that is,
Distance: resource vectors) in a given space. It is
calculated by squaring the difference between
the points (that is, the feature values) for
each feature, summing the squares over all
features, and taking the square root of the
resulting sum. The formula for a two-dimension
geometric distance is better known as the
Pythagorean theorem. A formula for calculating
geometric distance is presented in Equation 1
below.
Radius: The greatest geometric distance between a
centroid and any of its associated resource
vectors.
Objective An externally defined parameter which specifies
Radius: a maximum desired radius for clusters
identified by a clustering algorithm.
Splitting: The partitioning of a space into two smaller
spaces. A space is split in half along the
axis of greatest variance. A clustering
algorithm will continue to split resource
vector spaces as long as the radius of any
cluster in the partition is greater than the
specified objective radius.
Consider the following two-dimensional clustering example.
Fifteen resource vectors were selected for the example. They
are shown in the following table:
VECTOR NUM. X Y
=========== ===== =====
1 1 1
2 2 1
3 3 1
4 3 8
5 4 7
6 4 8
7 4 9
8 5 8
9 8 4
10 9 3
11 9 4
12 9 5
13 10 4
14 10 8
15 11 8
Note: The range of values selected for the two features in
this example are similar to avoid the problems that are
introduced by scaling considerations. Scaling is discussed
in Section 11.4.2. The objective radius for this example is
1.
A plot of the resource vectors is shown in Figure 11-5. The
resource vectors are represented by asterisks (*) and the
centroid of the space is represented by a pound sign (#).
10 |
|
9 | *
|
8 | * * * * *
|
7 | *
|
6 |
|
Y 5 | # *
|
4 | * * *
|
3 | *
|
2 |
|
1 | * * *
|
0 +------------------------------------------------
1 2 3 4 5 6 7 8 9 10 11 12
X
Figure 11-5. Initial Space
The first step in the solution process is to calculate the
feature means, variances, and radius for the space. (The
centroid is defined by the feature means.) The calculated
values are as follows:
CENTROID
SPACE FEATURE (MEAN) VARIANCE RADIUS
===== ========= ======== ======== ======
all X 6.13 11.41 6.67
Y 5.27 8.50
As mentioned in the definitions above, the radius of the
space is calculated using the geometric distance equation.
This equation states that the distance, D, between the
centroid of the k-th cluster, C(k), and the i-th resource
vector, R(i), is equal to the square root of the sum of the
squares of their feature differences. The general form of
the geometric distance equation is shown in Equation 1, where
j is the feature number index.
_ _
| _ _ | 1/2
| n | 2 | |
D = | SUM | ( C(k,j) - R(i,j) ) | | (Eqn 1)
| j=1 |_ _| |
|_ _|
For example, the farthest resource vector from the centroid
of the space shown in Figure 11-5 is the resource vector at
(1,1). The calculation of the distance between the resource
vector and the centroid is shown in Equation 2.
_ _
| 2 2 | 1/2
D = | (6.13-1.0) + (5.27-1.0) | (Eqn 2)
|_ _|
_ _
| | 1/2
D = | 26.32 + 18.23 | = 6.67
|_ _|
As shown in the previous table, the variance in the X
direction is 11.4 and the variance in the Y direction is
8.50. Since the radius of the space is greater than the
objective radius, the space is split at the centroid along
the axis of greatest variance. This results in two spaces, A
and B, as is shown in Figure 11-6.
10 |
| A | B
9 | * |
| |
8 | * * * | * *
| |
7 | * |
| |
6 | |
| # | #
Y 5 | | *
| |
4 | | * * *
| |
3 | | *
| |
2 | |
| |
1 | * * * |
|
0 +------------------------------------------------
1 2 3 4 5 6 7 8 9 10 11 12
X
Figure 11-6. First Split
The characteristics of the two spaces resulting from the
split are shown in the table below:
CENTROID
SPACE FEATURE (MEAN) VARIANCE RADIUS
===== ========= ======== ======== ======
A X 3.25 2.70 4.92
Y 5.38 13.41
B X 9.33 1.07 3.31
Y 5.14 4.14
Once again, the radii of the spaces are greater than the
objective radius. Hence, each of the spaces is split along
the axis of greatest variance. This results in four spaces,
A, B, C, and D, as shown in Figure 11-7. Note that in spaces
A, C, and D the centroids happen to coincide with a resource
vector.
10 |
| A | B
9 | * |
| |
8 | * # * | * # *
| |
7 | * |
| |
6 | |
| ---------------------- | ---------------------
Y 5 | | *
| C | D
4 | | * # *
| |
3 | | *
| |
2 | |
| |
1 | * # * |
|
0 +------------------------------------------------
1 2 3 4 5 6 7 8 9 10 11 12
X
Figure 11-7. Second Split
The characteristics of the four spaces resulting from the
split are shown in the table below:
CENTROID
SPACE FEATURE (MEAN) VARIANCE RADIUS
===== ========= ======== ======== ======
A X 4.0 0.50 1.0
Y 8.0 0.50
B X 10.5 0.50 0.5
Y 8.0 0.00
C X 2.0 1.00 1.0
Y 1.0 0.00
D X 9.0 0.50 1.0
Y 4.0 0.50
Since all of the radii are less than or equal to the
objective radii, a solution for the space introduced in
Figure 11-7 has been developed. Although this problem is
very simplistic, it illustrates the concepts of the K-mean
algorithm.
Thus, four clusters have been identified that represent the
15 resource vectors in the space. Although the actual
implementation of the clustering algorithm is somewhat more
complex, the same principles apply when thousands of workload
elements are analyzed with the program.
Copyright © 2014 CA.
All rights reserved.
 
|
|