Previous Topic: 11.4.2 ScalingNext Topic: 11.5 Component Operation


11.4.3 Algorithm Implementation


The workload characterization program is comprised of these
major steps:

 1 - Selecting data elements (features) from the CA MICS file
     based on user input.

 2 - Randomly selecting 2,000 resource vectors from the
     population selected from the CA MICS database.

 3 - Calculating the trimmed mean and standard deviation
     values for the 2,000 resource vector sample.

 4 - Scaling the resource vector sample using the trimmed
     population statistics.

 5 - Clustering the sample of scaled resource vectors.

 6 - Scaling the population of resource vectors using the
     trimmed population statistics.

 7 - Assigning the full population to the clusters developed
     for the sample of resource vectors.  This final step
     evaluates how well the population is represented by the
     clusters developed for the randomly selected sample.  If
     the population fits the proposed characterization, then
     the requirements of the first hypothesis have been
     satisfied.

The first step in the program is the selection of the user-
specified elements from a CA MICS database file. The CA MICS
file and database elements (features) are specified with
parameters on the Workload Characterization screen.  The
control parameters are discussed in Section 11.5.

The second step is the random selection of 2,000 resource
vectors from the population selected from the CA MICS
database.  Experience has shown that a sample size of 2,000
is suitable for analyzing populations as large as 500,000
workload elements.  The selection process involves sorting
the resource vectors into order by arrival time.  The samples
are then selected from the time-ordered population using a
uniform sampling rate.  You may specify alternative sample
sizes if the default sample size proves unsuitable.

The third step in the process is the calculation of the
trimmed mean and standard deviations for each of the
features.  This involves the sorting of the observations for
each feature into order so that the tail of the distribution
can be excluded.  Although 2.5% is the default value for
trimming the observations, you can override this value.

In the fourth step, the sample resource vectors are scaled
using the trimmed mean statistics.

In the fifth step, the SAS FASTCLUS procedure is used to
identify clusters in the sample resource vectors.  One
problem introduced by FASTCLUS is the way it treats outliers.
When FASTCLUS encounters an outlier, it assigns the point to
a cluster by itself to meet the objective radius criteria.
Other clustering algorithms, like ISODATA (HAL73),
selectively expand the radii of some clusters to avoid
forming clusters that represent less than a specified
percentage of the sample.

To implement a similar feature in the workload
characterization program, the results of the FASTCLUS
procedure are post processed to remove clusters that do not
represent a minimum percent (0.5%) of the sample.  Ball and
Hall introduced the concept of a minimum cluster size in the
ISODATA points algorithm in 1965 (BAL65).  Within limits,
selected cluster radii are expanded to include the majority
of the points that were represented by the deleted clusters.
The remaining resource vectors that do not fit within the
radii of any of the expanded clusters are "marginally
assigned" to the nearest cluster centroid.

Marginal assignment percentages of one or two are common. The
marginal assignment percentage, the percent of the sample
observations represented by each cluster, and the percent of
the resources represented by each cluster are also calculated
and reported.

In the sixth step, all of the resource vectors in the
population are scaled using the trimmed mean statistics.

In the final step, the marginal assignment percentage, the
percentage of the observations represented by each cluster,
and the percent of the resources represented by each of the
clusters are calculated and reported for the population.  If
the marginal assignment rate and the percent of the resource
vectors represented by each cluster agree with the values
calculated for the sample, then the first hypothesis has been
successfully tested.  You can request that the data set that
is created by the population assignment DATA step be copied
to a user-defined SAS file for later analysis.