Previous Topic: 3.1 MICF InquiriesNext Topic: 3.1.2 System Software Malfunction Summary Report


3.1.1 System Reliability Summary Report


The System Reliability Summary Report provides a method of
tracking and analyzing the overall reliability of an entire
system, based on a single system identification (SYSID).
Data is gathered from a number of the System Reliability
(SRL) files and is summarized under the following categories:

    o Processor Reliability Indicators
    o Software Reliability Indicators
    o Device Reliability Indicators
    o Media Reliability Indicators
    o Special Reliability Indicators

The objective of the report is to present data that can be
used to identify areas where problems have occurred during
the reporting period and to show the trend of failures in
specific areas over a number of days.  For each indicator,
the counts of key system events or summaries of significant
error conditions are provided for one or more days.  The key
system events include failures or conditions such as machine
checks, channel checks, or processor wait states.  The
significant error conditions include counts such as the
number of temporary and permanent failures by device class
and the number of system software errors.

The indicators should be reviewed with respect to the
standards and procedures being followed in your installation.
The policies and procedures used in each installation clearly
affect the use and interpretation of the reliability
indicators.

PROCESSOR RELIABILITY INDICATORS

Processor reliability indicators generally reflect the status
of the processor and its associated storage and channels.
Failures in this area can result in degradations or, in the
case of a serious failure, to interruptions to the services
provided by the system.

The processor reliability indicators examine errors and
conditions which either degrade services or interrupt the
processing of the entire system.  The trend over the time
period selected is a first indication of whether the system
is doing better or worse than before.

The following indicators are provided:

   01.IPLS - the number of times the processor was IPLed.
             Each IPL potentially represents an interruption
             to service.

             This count may be higher than the count from
             SMF.  The SMF IPL record is only written if SMF
             is successfully started during the IPL process.
             If an IPL occurs and the system fails or is
             IPLed again before SMF is started, the
             reliability value and SMF value will disagree.

             IPLs scheduled for maintenance, testing, etc.
             should be taken into consideration.

   02.TERMINATION EVENTS - the number of times the system
             went through the 'End of Day' processing.

             If your installation requires that the system is
             shut down in an orderly fashion, that is, that
             the Z EOD command is used at all normal
             shutdowns, then this value may be used with the
             number of IPLs as a key reliability indicator.
             If unscheduled IPLs occurred and this value is
             0, then you can assume that the system has
             crashed or has come down as the result of an
             error.

   03.PROCESSOR CHECKS - the number of times a machine
             check was encountered on the processor.

   04.STORAGE CHECKS - the number of times a machine check
             involving processor storage was encountered.

   05.CHANNEL CHECKS - the number of times a channel check
             occurred.

             Channel checks are all indicators of hardware
             problems that either degraded or interrupted the
             system.  If the number of IPLs is zero or all
             represent scheduled IPLs, then these errors
             degraded the system.

   06.I/O SUPVR WAIT STATES - the number of times that the
             processor was stopped or halted by an error
             during an Input/Output operation.

             I/O supervisor wait states represent a system
             degradation if the error was recognized as a
             wait state.  If the wait was not recognized,
             then the system may have been IPLed,
             interrupting all services.


SOFTWARE RELIABILITY INDICATORS

Software reliability indicators reflect the status of the
operating system software and the status of user software
which logs information to the system error recording data
set.  Software errors can result in degradation or
interruptions to the services provided by the system.

They represent the quantity or volume of errors that have
occurred in the software.  The overall size of a number is an
indication of whether further analysis of the software
failures is required.

The following indicators are provided:

   01.MACHINE CHECK RELATED - the number of failures
             encountered by software modules or routines
             that were related to machine checks.

   02.OPERATOR DETECTED - the number of failures detected
             and logged as the result of an system operator
             action.

   03.ABENDS,PGM INTERRUPTS - the number of failures
             encountered by software modules or routines
             that were the result of an abend or program
             interrupt.

   04.LOST RECORDS - the number of records that were lost
             or not recorded on the system error recording
             data set.

             LOST RECORDS is an indication that some
             number of records could not be written to the
             error recording data set.  If any value appears
             here, efforts should be made to determine what
             type of condition caused the lost records.  A
             large number of channel failures, for example,
             could have caused a lost record condition
             because the mode of transmission was the cause
             of loss.


DEVICE RELIABILITY INDICATORS

Device reliability indicators reflect the status of the
devices, by device class, attached to the processor.  Device
errors can result in many different failures, depending on
the use of the device and the severity of the error.

They represent the overall reliability of the devices
attached to the system.  The size of a number or error count
is an indication of whether further analysis of the device
detail information is required.

The following indicators are provided:

   01.MISSING INTERRUPT EVENTS - the number of times that
             an I/O interrupt has been missed or dropped by
             a device.

   02.RECONFIGURATION EVENTS - the number of times that a
             permanent error on direct access or magnetic
             tape has resulted in a dynamic device
             reconfiguration or swap to an alternate device.

             RECONFIGURATION EVENTS represent permanent
             errors that caused a dynamic device
             reconfiguration or swap to an alternate device.
             One or more permanent errors should appear in
             the direct access or magnetic tape values.

   nn.PERMANENT ERRORS - the number of permanent errors
             encountered by devices within each of the
             following device classes:
                  03.  DASD (direct access)
                  04.  TAPE (magnetic tape)
                  05.  TP (teleprocessing)
                  06.  U/R (unit record)

             06.PERMANENT ERRORS (U/R) is an indication of
             unrecoverable errors that occurred.  This is an
             indication that further analysis is required.

   nn.TEMPORARY ERRORS - the number of temporary errors
             encountered by devices within each of the
             following device classes:
                  07.  DASD (direct access)
                  08.  TAPE (magnetic tape)
                  09.  TP (teleprocessing)
                  10.  U/R (unit record)

             10.TEMPORARY ERRORS (U/R) is an indication of
             recoverable errors that occurred.  If the
             numbers are large, further analysis is required.


SPECIAL RELIABILITY INDICATORS

Special reliability indicators reflect the status of
special reliability events or errors that have occurred.
These counts generally provide a more detailed review of
indicators for specific devices attached to the system.

They represent errors and conditions that are being tracked
specifically by the installation.

The following indicator is provided:

   01.LASER PRINTER ERRORS - the number of permanent or
             significant errors which have occurred on laser
             printer devices.

             01.LASER PRINTER ERRORS represent the number of
             temporary errors and permanent errors related
             specifically to the laser printers attached to
             the system.  Large values are an indication that
             further analysis is required.

INQUIRY ID:

     SRLLD1

DATA SOURCE (file/timespan):

     SRLDRL, SRLMRL, SRLTRL, SRLXRL and SRLRNC at the
     DETAIL timespan.


DATA ELEMENTS USED:

The data elements used for this inquiry are the following:
___________________________________________________________
|          |                                               |
|  FILE    |               DATA ELEMENTS                   |
|__________|_______________________________________________|
|          |                                               |
|  SRLDRL  | DRLPRMCT DRLTMPCT                             |
|  SRLMRL  | MRLLOGTY MRLPRMCT MRLTMPCT MRLMTS             |
|  SRLTRL  | TRLPRMCT                                      |
|  SRLXRL  | XRLPRMCT                                      |
|  SRLRNC  | RNCTYPE                                       |
|          |                                               |
|__________|_______________________________________________|

CA 09:01 THURSDAY, MAY 8, 2008 CA MICS I/S MANAGEMENT SUPPORT SYSTEM RELIABILITY SUMMARY System Identifier S008 -------------------------------------------------------------------------------------------------------------------------- | | FAILURE SUMMARY | | | |-------------------------------------------------------| | | |01MAY08|02MAY08|03MAY08|04MAY08|05MAY08|06MAY08|07MAY08| TOTAL | | |-------+-------+-------+-------+-------+-------+-------+-------| | |NO. OF |NO. OF |NO. OF |NO. OF |NO. OF |NO. OF |NO. OF |NO. OF | | |ERRORS |ERRORS |ERRORS |ERRORS |ERRORS |ERRORS |ERRORS |ERRORS | |--------------------------------------------------------+-------+-------+-------+-------+-------+-------+-------+-------| |RELIABILITY INDICATORS |FAILURE CATEGORIES | | | | | | | | | |---------------------------+----------------------------| | | | | | | | | |A. PROCESSOR |01.IPLS | 1| 2| 1| 1| 3| 3| 3| 14| | +----------------------------+-------+-------+-------+-------+-------+-------+-------+-------| | |05.CHANNEL CHECKS | 12| 10| 4| 2| 3| 4| 4| 39| |---------------------------+----------------------------+-------+-------+-------+-------+-------+-------+-------+-------| |B. SOFTWARE |03.ABENDS,PGM INTERRUPTS | 3| 15| 7| 10| 12| .| .| 47| | |----------------------------+-------+-------+-------+-------+-------+-------+-------+-------| | |04.LOST RECORDS | .| 2| 1| 4| 3| 1| 1| 12| |---------------------------+----------------------------+-------+-------+-------+-------+-------+-------+-------+-------| |C. DEVICE |01.MISSING INTERRUPT EVENTS | 8| 1| .| .| 62| .| .| 71| | |----------------------------+-------+-------+-------+-------+-------+-------+-------+-------| | |02.RECONFIGURATION EVENTS | .| 2| 1| .| .| .| .| 3| | |----------------------------+-------+-------+-------+-------+-------+-------+-------+-------| | |03.PERMANENT ERRORS-DASD | 5| 10| 4| 4| 15| 5| 5| 48| | |----------------------------+-------+-------+-------+-------+-------+-------+-------+-------| | |04.PERMANENT ERRORS-TAPE | 5| 8| 4| 7| 12| .| .| 36| | |----------------------------+-------+-------+-------+-------+-------+-------+-------+-------| | |05.PERMANENT ERRORS-TP | 3| 63| 1| 1| 7| .| .| 75| | |----------------------------+-------+-------+-------+-------+-------+-------+-------+-------| | |06.PERMANENT ERRORS-U/R | 8| 9| 11| 6| 13| 3| .| 50| | |----------------------------+-------+-------+-------+-------+-------+-------+-------+-------| | |07.TEMPORARY ERRORS-DASD | 11| 16| 7| 6| 2| 3| .| 45| | |----------------------------+-------+-------+-------+-------+-------+-------+-------+-------| | |09.TEMPORARY ERRORS-TP | 5| 5| 9| 6| 6| .| .| 31| | |----------------------------+-------+-------+-------+-------+-------+-------+-------+-------| | |10.TEMPORARY ERRORS-U/R | 19| 11| 10| 12| 15| 7| .| 74| |---------------------------+----------------------------+-------+-------+-------+-------+-------+-------+-------+-------| |D. SPECIAL |01.LASER PRINTER ERRORS | 19| 11| 10| 12| 15| 7| .| 74| |--------------------------------------------------------+-------+-------+-------+-------+-------+-------+-------+-------| |TOTAL ERRORS | 81| 145| 60| 60| 219| 25| 3| 593| --------------------------------------------------------------------------------------------------------------------------


 Figure 3-2.  System Reliability Summary Report