3.3.2.4 Data Selection and Outliers

3. PERFORMANCE ANALYSIS TOOLS › 3.3 Data Clustering Analysis › 3.3.2 Usage Guidelines › 3.3.2.4 Data Selection and Outliers
3.3.2.4 Data Selection and Outliers


Selecting data elements for the clustering procedure is one
of the most important steps in the workload characterization
process.  In clustering terminology, these data elements are
known as features.  The result of the clustering process is a
ranking of similar types of workloads, based on the mean
values of these features.

The objective is to select the minimum number of features
that can be used to accurately describe the characteristics
of the workload to be analyzed.  A small number of features
is desirable, since it greatly reduces the number of sources
of variability in the characterization, making the results
more robust and easier to understand.  Note that each time an
additional feature is added, splits may occur in as many as
half of the existing clusters.

Another consideration in the feature selection process is the
representation of outliers.  Outliers are data elements that
cannot be generalized as similar to other data elements in
any of the chosen features.  One example of an outlier is a
batch job which falls into a CPU-bound loop due to a
programming error.  Another example is a periodic process,
such as a monthly close of a financial application, which is
run rarely but requires a large number of resources. Outliers
have the highest probability of being poorly represented by
the developed workload characterization.

There are two classes of features that are normally used in
workload characterization studies:

o  Rate-based features, such as EXCPs per CPU second
o  Total features, such as CPU minutes

Rate-based features are desirable because they allow you to
control the content of the outliers.  For example, consider
the use of CPU minutes and EXCPs as total features for
characterizing the resource consumption of a workload.
Because the largest jobs would be outliers, the resulting
characterization would poorly represent them.  Unfortunately,
these jobs might represent a significant percentage of the
resources that are consumed.

To avoid this problem, you could use rate-based features.
When you divide EXCPs by CPU seconds, you change the content
of the outliers.  Rather than representing jobs that consume
significant resources, they represent jobs with exceptional
EXCP/CPU second rates.  The probability that these jobs would
consume significant resources is very small, since the ratio
of elapsed time to CPU resource consumption for I/O-bound
jobs is very high.

Such rate-based features have been called instantaneous
workload descriptions, since they characterize the activities
of a workload element for an instant of CPU time (ART76,
ART78).

Although using total features introduces concerns about
representing outliers, they have many applications to
capacity planning and performance problems.  For example,
using total features for job and transaction class structures
treats the occasional massive workload element as an
exception.  By handling the occasional element as an
exception, the system can concentrate on exploiting the
characteristics of the majority of the workload elements to
simplify scheduling.  You must, however, always be aware of
the content of the outliers and pay particular attention to
the marginally assigned resource vectors to verify that the
objectives of the study are not compromised.