The technique of exception reporting has always had value for data processing management. The increase in growth and complexity, however, make exception reporting a necessity in even the smallest z/OS data centers. The problem is that a given CPU processor or processor complex of CPUs operates using numerous components (e.g., SCP, TSO, JES2, Batch, VTAM), each of which are performing a significant level of processing. This combination of functions, which generally processes larger and more diverse workloads, results in an increase in complexity and load making the management control problem difficult, if not impossible. It is because of this increased complexity and activity that an exception reporting process should be used as a diagnostic filter to report specific problems or potential problem areas. The concept of exception reporting operates like an automated medical system for diagnosing the presence or probability of heart failure. By processing the monitored responses of the patient and matching them against previous patterns of heart sickness and certain definable thresholds for blood pressure, pulse rate, etc., it is possible to provide a diagnosis of the possible problems. The Exception Reporting system described in this chapter operates like a medical diagnosis system, by inputting available monitoring sources (e.g., RMF, CA TSO/MON PM, SMF), comparing this activity against pre-defined thresholds, and providing an integrated exception list of potential problem areas. Exception Reporting is designed to reduce the number and volume of reports that the systems programmer, performance analyst, security officer, and so on has to wade through for analysis. Furthermore, by reporting the exceptions from the different components in an integrated manner, the time spent in problem tracking should be reduced while at the same time increasing the effectiveness of the systems and performance teams through a more controlled, systematic reporting of the exceptional conditions impacting the installation's operation.
IDENTIFYING AND QUALIFYING EXCEPTIONS
The identification and qualification of the exceptions to
be reported is essential to an effective and usable exception
reporting process.
The identification of which exceptions should be reported is
addressed in large part by the exceptions which are
distributed as a standard part of CA MICS. The concept of
exception analysis is to identify and report only those
occurrences which merit visibility and attention. Exception
reporting may be used to report an occurrence that is a
distinct problem (e.g., TCAM/VTAM outage at 2:00 pm), one
that may be a problem (e.g., TSO user overloaded the system
from 1:00 to 1:30 pm) and requires further research, or
represents a standard, security, or audit violation (e.g.,
user XYZ is not authorized to use SUPERZAP and was detected
using it seven times last week).
The user may tailor the standard exceptions as explained
in this chapter (in the section on Exception Values).
It is one thing to define exceptions, but quite another
problem to organize and report them in a usable manner. Most
individuals would expect that simply identifying the
exceptions finishes the job. The anomaly one will encounter
is that the exceptions themselves will probably be quite
voluminous and they too, require categorization, aggregation,
consolidation, and prioritization. This is what is meant by
exception qualification.
The Exception Reporting process enables an exception to
be qualified, and thereby reported, in the following ways:
o Exception Number for unique definition
Exception numbers uniquely identify individual
exceptions. The numbers are sequentially assigned
within the sharedprefix.MICS.SOURCE(DYcccEXC) members.
o Severity Level to signify degree of importance
A severity level code is assigned to each exception in
order to differentiate the importance of different
exception types. The definition of severity level
allows for three categories: critical, impacting, and
warning. The assignment of severity level is, of
course, subjective. The following guidelines are
suggested for this purpose.
o Critical: Assigned to an exception that represents a
missed service guarantee (e.g., availability,
response, turnaround), a missed management objective
(e.g., maximum of 5 IPLs per month), a security
violation, or a serious violation of an installation
standard or audit guideline.
o Impacting: Assigned to an exception that represents
performance degradation related to reliability,
service, capacity, turnaround, etc., which has
created a political situation, or has in any way
manifested itself in a noticeable problem short of
the critical definition.
o Warning: Assigned to an exception that represents a
preventative maintenance problem (e.g., buffers are
running low), a symptomatic performance problem
(e.g., demand paging rate is above normal), or a
general installation standard or audit guideline that
was violated.
The assignment of the severity level enables the
exception reports to be prioritized by the level of
seriousness of the reported problems, as well as
provide a method for exception report selection.
o Management Area to identify area of responsibility
A management area code is assigned to each exception
enabling the exception to be associated to the area of
responsibility (e.g., Availability).
The management area code is used primarily for
reporting purposes and provides a means to organize the
exceptions.
The following list depicts the management areas defined
and in use:
o Availability: Computing hardware (e.g., CPU) or
software subsystems (e.g., TSO) reliability and
availability.
o Performance: Computing hardware, operations,
supervisory software, or program product performance.
o Productivity: Operational and development personnel
productivity.
o Security: Physical access, system integrity, and
data access security.
o Service: Online response times and batch turnarounds.
o Standards: Enforcement of installation defined
standards, guidelines, and policies.
o Workload: User submitted load in terms of system
effectiveness, performance, and operation.
The management area assignment then enables the
exceptions to be analyzed by Information Areas (e.g.,
TSO, Hardware Utilization) or Information Area within
management area.
| Copyright © 2011 CA. All rights reserved. | Email CA Technologies about this topic |