2. PERFORMANCE REPORTING ANALYSIS › 2.8 Sysplex Analysis › 2.8.4 Parallel Sysplex Concepts › 2.8.4.2 Controlling the Sysplex Environment › 2.8.4.2.3 Sysplex Failure Management
2.8.4.2.3 Sysplex Failure Management
SFM policy allows you to specify what actions MVS should
take when certain failures occur in the sysplex. Basically
this amounts to accommodating the removal of one or more
systems from the sysplex in order that the remaining systems
can continue to do work.
Two main types of failures come into play in a sysplex
environment:
o System failures, indicated by a status update missing
condition
o Signaling connectivity failures in the sysplex
The first of these is fairly straight forward. If a system
does not update its status information within a predetermined
period of time, a "status update missing" condition is
raised. SFM allows you to specify how to respond to this
condition. The most common options for addressing this are
to prompt the operator for help in resolving the problem, or
to "isolate" the failing system without operator
intervention. System isolation terminates I/O and coupling
facility accesses, resets channel paths, and loads a
nonrestartable wait state on the failing system, thus
ensuring that the system is unable to corrupt shared
resources.
In the case of a loss of signaling connectivity, the
performance issues and policy decisions are more pertinent.
All systems in the sysplex must have signaling paths to and
from every other system sharing in the sysplex. If, for some
reason, signaling connectivity between any sysplex systems
is lost, then one or more systems can be removed (under SFM)
so that the systems that remain in the sysplex still have
full connectivity with each other.
In handling a system connectivity failure, SFM attempts to
maximize the aggregate value of the surviving sysplex to the
installation. SFM allows you to assign a relative importance
to each system in the sysplex. Then, if connectivity between
any two systems is lost, SFM decides which of the two systems
to retain in the sysplex based upon its relative importance.
How you establish your SFM policy can have a very substantial
impact on the performance of the sysplex in the event that
some sort of failure does occur. For example, suppose that
you have a three system sysplex, SYSA, SYSB, and SYSC.
Further suppose that SYSA is a larger mainframe, and that
SYSB is a rather small system, but supporting mission
critical applications. If SYSA and SYSB lose connectivity
with each other, SFM decides which system to keep in the
sysplex, and which one to let go. If SYSA is released,
considerable computing power is no longer available to the
sysplex, and performance may suffer substantially. If SYSB
is released, its critical applications are no longer
available, but the sysplex may have enough computing power
to continue to provide adequate service in other areas.
Tracking and monitoring sysplex handling of system failures
should be an ongoing process, providing feedback to the
policymakers who need to decide upon the optimal system
failure management configuration.