2.1 The Exception Reporting Process


The technique of exception reporting has always had value for
data processing management.  The increase in growth and
complexity, however, make exception reporting a necessity in
even the smallest z/OS data centers.  The problem is that a
given CPU processor or processor complex of CPUs operates
using numerous components (e.g., SCP, TSO, JES2, Batch,
VTAM), each of which are performing a significant level of
processing.  This combination of functions, which generally
processes larger and more diverse workloads, results in an
increase in complexity and load making the management control
problem difficult, if not impossible.

It is because of this increased complexity and activity
that an exception reporting process should be used as a
diagnostic filter to report specific problems or potential
problem areas.  The concept of exception reporting operates
like an automated medical system for diagnosing the presence
or probability of heart failure.  By processing the monitored
responses of the patient and matching them against previous
patterns of heart sickness and certain definable thresholds
for blood pressure, pulse rate, etc., it is possible to
provide a diagnosis of the possible problems.

The Exception Reporting system described in this chapter
operates like a medical diagnosis system, by inputting
available monitoring sources (e.g., RMF, CA TSO/MON PM, SMF),
comparing this activity against pre-defined thresholds, and
providing an integrated exception list of potential problem
areas.

Exception Reporting is designed to reduce the number and
volume of reports that the systems programmer, performance
analyst, security officer, and so on has to wade through for
analysis.  Furthermore, by reporting the exceptions from the
different components in an integrated manner, the time spent
in problem tracking should be reduced while at the same time
increasing the effectiveness of the systems and performance
teams through a more controlled, systematic reporting of the
exceptional conditions impacting the installation's
operation.

IDENTIFYING AND QUALIFYING EXCEPTIONS


The identification and qualification of the exceptions to
be reported is essential to an effective and usable exception
reporting process.

The identification of which exceptions should be reported is
addressed in large part by the exceptions which are
distributed as a standard part of CA MICS. The concept of
exception analysis is to identify and report only those
occurrences which merit visibility and attention.  Exception
reporting may be used to report an occurrence that is a
distinct problem (e.g., TCAM/VTAM outage at 2:00 pm), one
that may be a problem (e.g., TSO user overloaded the system
from 1:00 to 1:30 pm) and requires further research, or
represents a standard, security, or audit violation (e.g.,
user XYZ is not authorized to use SUPERZAP and was detected
using it seven times last week).

The user may tailor the standard exceptions as explained
in this chapter (in the section on Exception Values).

It is one thing to define exceptions, but quite another
problem to organize and report them in a usable manner.  Most
individuals would expect that simply identifying the
exceptions finishes the job.  The anomaly one will encounter
is that the exceptions themselves will probably be quite
voluminous and they too, require categorization, aggregation,
consolidation, and prioritization.  This is what is meant by
exception qualification.

The Exception Reporting process enables an exception to
be qualified, and thereby reported, in the following ways:

      o Exception Number for unique definition

      Exception numbers uniquely identify individual
      exceptions.  The numbers are sequentially assigned
      within the sharedprefix.MICS.SOURCE(DYcccEXC) members.
   
      o Severity Level to signify degree of importance
      A severity level code is assigned to each exception in
      order to differentiate the importance of different
      exception types.  The definition of severity level
      allows for three categories:  critical, impacting, and
      warning.  The assignment of severity level is, of
      course, subjective.  The following guidelines are
      suggested for this purpose.
     
      o Critical:  Assigned to an exception that represents a
      missed service guarantee (e.g., availability,
      response, turnaround), a missed management objective
      (e.g., maximum of 5 IPLs per month), a security
      violation, or a serious violation of an installation
      standard or audit guideline.
     
      o Impacting:  Assigned to an exception that represents
      performance degradation related to reliability,
      service, capacity, turnaround, etc., which has
      created a political situation, or has in any way
      manifested itself in a noticeable problem short of
      the critical definition.
      
      o Warning:  Assigned to an exception that represents a
      preventative maintenance problem (e.g., buffers are
      running low), a symptomatic performance problem
      (e.g., demand paging rate is above normal), or a
      general installation standard or audit guideline that
      was violated.
      
      The assignment of the severity level enables the
      exception reports to be prioritized by the level of
      seriousness of the reported problems, as well as
      provide a method for exception report selection.

     o Management Area to identify area of responsibility
      A management area code is assigned to each exception
      enabling the exception to be associated to the area of
      responsibility (e.g., Availability).
      The management area code is used primarily for
      reporting purposes and provides a means to organize the
      exceptions.
    
      The following list depicts the management areas defined
      and in use:

      o Availability:   Computing hardware (e.g., CPU) or
        software subsystems (e.g., TSO) reliability and
        availability.

      o Performance:   Computing hardware, operations,
        supervisory software, or program product performance.
     
      o Productivity:  Operational and development personnel
        productivity.

      o Security:  Physical access, system integrity, and
        data access security.
     
      o Service: Online response times and batch turnarounds.

      o Standards:  Enforcement of installation defined
        standards, guidelines, and policies.
      
      o Workload:  User submitted load in terms of system
        effectiveness, performance, and operation.

      The management area assignment then enables the
      exceptions to be analyzed by Information Areas (e.g.,
      TSO, Hardware Utilization) or Information Area within
      management area.