

3. PERFORMANCE ANALYSIS TOOLS › 3.3 Data Clustering Analysis › 3.3.4 Analytic Technique Tutorial › 3.3.4.1 Clustering Algorithms
3.3.4.1 Clustering Algorithms
Although there are a wide variety of statistical algorithms
that you can use to develop clusters, one of the simplest and
most popular algorithms is the K-means algorithm (MAC67). A
hybrid of both the hierarchical and the K-means algorithm is
employed within Neugents technology for clustering
operations.
To understand how clustering algorithms work, you must
understand the following definitions.
Resource An n-tuple of measurements selected to describe
Vector: a workload element. For example, you might
choose to describe jobs in terms of the three
distinct measurements: CPU time, print lines,
and tape mounts. Each job would then be
represented by a 3-tuple (CPU time, print
lines, and tape mounts). This is also
sometimes referred to as a usage vector, or
simply a vector. Neugents technology uses the
term "pattern" when describing such a set of
related measurements.
Feature: A component of a resource vector. In the above
example, CPU time, print lines, and tape mounts
are the three features of resource vectors
describing the workload element, job.
Space: The n-dimensional coordinate system in which
resource vectors are defined.
Feature The mean of all values of a given feature taken
Mean: over a set of resources.
Feature The variance (that is, the standard deviations
Variance: squared) associated with the distribution of
values of a given feature, taken over a set of
resource vectors. Note that the feature with
the greatest variance defines the longest
coordinate axis of the space.
Centroid: The geometric center of the set of all resource
vectors contained in a space. The location of
the centroid is defined by an n-feature vector.
Its component values are the feature means of
the resource vectors in the space.
Cluster: A set of resource vectors in a given space plus
their centroid. Note that the centroid of a
set of vectors need not be one of the original
vectors in the space.
Geometric The distance between any two points (that is,
Distance: resource vectors) in a given space. It is
calculated by squaring the difference between
the points (that is, the feature values) for
each feature, summing the squares over all
features, and taking the square root of the
resulting sum. The formula for a two-dimension
geometric distance is better known as the
Pythagorean theorem. A formula for calculating
geometric distance is presented in the
following first equation.
Radius: The greatest geometric distance between a
centroid and any of its associated resource
vectors.
Relative The ratio of the geometric distance between a
Position: given resource vector or pattern and the radius
of the cluster, expressed as a value from 0.00
to 1.00. The lower the value of the relative
position, the closer the particular resource
vector or pattern is to the cluster center.
Objective An externally defined parameter that specifies
Radius: a maximum desired radius for clusters
identified by a clustering algorithm.
Splitting: The partitioning of a space into two smaller
spaces. A space is split in half along the
axis of greatest variance. A clustering
algorithm will continue to split resource
vector spaces as long as the radius of any
cluster in the partition is greater than the
specified objective radius.
Consider the following two-dimensional clustering example.
Fifteen resource vectors were selected for the example. They
are shown in the following table:
VECTOR NUM. X Y
=========== ===== =====
1 1 1
2 2 1
3 3 1
4 3 8
5 4 7
6 4 8
7 4 9
8 5 8
9 8 4
10 9 3
11 9 4
12 9 5
13 10 4
14 10 8
15 11 8
Note: The range of values selected for the two features in
this example are similar. This avoids the problems that are
introduced by scaling considerations. Scaling is discussed
in Section 15.4.2. The objective radius for this example is
1.
A plot of the resource vectors is shown in Figure 3-40. The
resource vectors are represented by asterisks (*) and the
centroid of the space is represented by a pound sign (#).
10 |
|
9 | *
|
8 | * * * * *
|
7 | *
|
6 |
|
Y 5 | # *
|
4 | * * *
|
3 | *
|
2 |
|
1 | * * *
|
0 +------------------------------------------------
1 2 3 4 5 6 7 8 9 10 11 12
X
Figure 3-40. Initial Space
The first step in the solution process is to calculate the
feature means, variances, and radius for the space. (The
centroid is defined by the feature means.) The calculated
values are as follows:
CENTROID
SPACE FEATURE (MEAN) VARIANCE RADIUS
===== ========= ======== ======== ======
all X 6.13 11.41 6.67
Y 5.27 8.50
As mentioned previously, the radius of the space is
calculated using the geometric distance equation. This
equation states that the distance, D, between the centroid of
the k-th cluster, C(k), and the i-th resource vector, R(i),
is equal to the square root of the sum of the squares of
their feature differences. The general form of the geometric
distance equation is shown in Equation 1, where j is the
feature number index.
_ _
| _ _ | 1/2
| n | 2 | |
D = | SUM | ( C(k,j) - R(i,j) ) | | (Eqn 1)
| j=1 |_ _| |
|_ _|
For example, the farthest resource vector from the centroid
of the space shown in Figure 3-40 is the resource vector at
(1,1). The calculation of the distance between the resource
vector and the centroid is shown in Equation 2.
_ _
| 2 2 | 1/2
D = | (6.13-1.0) + (5.27-1.0) | (Eqn 2)
|_ _|
_ _
| | 1/2
D = | 26.32 + 18.23 | = 6.67
|_ _|
As shown in the previous table, the variance in the X
direction is 11.4 and the variance in the Y direction is
8.50. Since the radius of the space is greater than the
objective radius, the space is split at the centroid along
the axis of greatest variance. This results in two spaces, A
and B, as is shown in Figure 3-41.
10 |
| A | B
9 | * |
| |
8 | * * * | * *
| |
7 | * |
| |
6 | |
| # | #
Y 5 | | *
| |
4 | | * * *
| |
3 | | *
| |
2 | |
| |
1 | * * * |
|
0 +------------------------------------------------
1 2 3 4 5 6 7 8 9 10 11 12
X
Figure 3-41. First Split
The characteristics of the two spaces resulting from the
split are shown in the following table.
CENTROID
SPACE FEATURE (MEAN) VARIANCE RADIUS
===== ========= ======== ======== ======
A X 3.25 2.70 4.92
Y 5.38 13.41
B X 9.33 1.07 3.31
Y 5.14 4.14
Once again, the radii of the spaces are greater than the
objective radius. Hence, each of the spaces is split along
the axis of greatest variance. This results in four spaces,
A, B, C, and D, as shown in Figure 3-42. Note that in spaces
A, C, and D the centroids happen to coincide with a resource
vector.
10 |
| A | B
9 | * |
| |
8 | * # * | * # *
| |
7 | * |
| |
6 | |
| ---------------------- | ---------------------
Y 5 | | *
| C | D
4 | | * # *
| |
3 | | *
| |
2 | |
| |
1 | * # * |
|
0 +------------------------------------------------
1 2 3 4 5 6 7 8 9 10 11 12
X
Figure 3-42. Second Split
The characteristics of the four spaces resulting from the
split are shown in the following table.
CENTROID
SPACE FEATURE (MEAN) VARIANCE RADIUS
===== ========= ======== ======== ======
A X 4.0 0.50 1.0
Y 8.0 0.50
B X 10.5 0.50 0.5
Y 8.0 0.00
C X 2.0 1.00 1.0
Y 1.0 0.00
D X 9.0 0.50 1.0
Y 4.0 0.50
Since all of the radii are less than or equal to the
objective radii, a solution for the space introduced in
Figure 3-42 has been developed. Although this problem is
very simplistic, it illustrates the concepts of the K-mean
algorithm.
Thus, four clusters have been identified that represent the
15 resource vectors in the space. Although the actual
implementation of the clustering algorithm is somewhat more
complex, the same principles apply when thousands of workload
elements are analyzed with the program.
Copyright © 2014 CA.
All rights reserved.
 
|
|