Previous Topic: Balancing Job Loads Among Nodes in an OpenVMS Cluster

Next Topic: Getting Started with Job Management Manager

Recovering Jobs from Nodes That Fail

If a node fails while a job is running, you may want to restart the job on another node. Or, you may decide to put the job on hold and notify the job’s owner to perform manual cleanup.

You can control whether or not to restart a job by specifying the /RESTART or /NORESTART qualifier with the DCL commands SCHEDULE CREATE, SCHEDULE MODIFY, or SCHEDULE COPY, or by choosing Restart options in the DECwindows interface.

You can also restart a job from a particular checkpoint in the job. To set up checkpoints, use the SCHEDULE SET RESTART_VALUE command in your DCL command procedure.

If a node on which the manager is running fails, and the manager is running on at least one other node in the OpenVMS Cluster, the manager:

The manager evaluates the error messages it receives and determines whether the failure is due to its inability to create a detached process or to the job itself failing. If a job has /RESTART set and fails because of a system error that prevents creation of an OpenVMS process, the job is rescheduled according to its interval; /RETRY has no impact on when the job is rescheduled. If the job was created with the /NORESTART qualifier, the job is put on hold and is not restarted automatically.