Previous Topic: 3.3 Data Clustering AnalysisNext Topic: 3.3.2 Usage Guidelines


3.3.1 Functional Description


During the early 1970s, Domenico Ferrari introduced the
concept of workload characterization (FER72).  The basic
premise of the concept was that studies of computer systems
could be facilitated by the identification of a number of
"resource patterns" that could be used to statistically
represent a system's workload.  Since capacity planning and
performance management studies often encompass tens of
thousands of jobs or hundreds of thousands of transactions,
the ability to represent the workload by a limited number of
resource patterns can significantly simplify a study.
(Typical studies result in the identification of from 10 to
30 patterns.)

Ferrari identified these resource patterns, which he called
clusters, by visual inspection of two-dimensional scatter
plots.  For example, Ferrari reported studies where he
examined scatter plots of print lines and CPU seconds
representing the resource usage patterns of student jobs.
Unfortunately, the dependence on visual inspection limited
the technique to two or perhaps three dimensions, since
graphic representations become extremely complex for higher
dimensions.  Also, the technique depended upon the visual
skill of the interpreter, since some analysts could "see"
patterns better than others.  These problems limited the
applications of what was otherwise a valid concept.

In 1976, however, two papers were published that detailed the
application of statistical pattern recognition (clustering)
techniques to the workload characterization problem.  They
allowed problems of higher dimensions to be solved and
eliminated the dependence on the analyst's interpretive skill
and judgment.  Agrawala and Mohr reported on their initial
studies of Univac workloads at the University of Maryland
(AGR76) and Artis reported on studies of IBM workloads at
Bell Laboratories (ART76).  Since that time, these and a
number of other authors have published numerous papers on the
subject in various publications, such as the Computer
Measurement Group (CMG) proceedings.

Data Clustering analysis tool allows you to apply
cluster-based techniques to the CA MICS database.  The
software can be applied to any of the unsummarized files that
are supported in the DETAIL timespan.  The program identifies
clusters (that is, similar patterns of activity) in the
workload using the SAS FASTCLUS procedure.

The use of clustering techniques introduces other statistical
issues.  Among these are scaling and the treatment of
outliers.  Improper scaling and failure to account for the
influence of abnormally large observations are the most
common reasons for invalid workload characterization studies.