Recovering Jobs from Nodes That Fail

How Job Management Manager Works › Job Management Manager in an OpenVMS Cluster › Recovering Jobs from Nodes That Fail

Recovering Jobs from Nodes That Fail

If a node fails while a job is running, you may want to restart the job on another node. Or, you may decide to put the job on hold and notify the job’s owner to perform manual cleanup.

You can control whether or not to restart a job by specifying the /RESTART or /NORESTART qualifier with the DCL commands SCHEDULE CREATE, SCHEDULE MODIFY, or SCHEDULE COPY, or by choosing Restart options in the DECwindows interface.

You can also restart a job from a particular checkpoint in the job. To set up checkpoints, use the SCHEDULE SET RESTART_VALUE command in your DCL command procedure.

If a node on which the manager is running fails, and the manager is running on at least one other node in the OpenVMS Cluster, the manager:

Updates the status of any interrupted jobs to aborted.
Notifies users of the failure.
Restarts the job, if another CPU is available and if the job has been set with the /RESTART qualifier.

The manager evaluates the error messages it receives and determines whether the failure is due to its inability to create a detached process or to the job itself failing. If a job has /RESTART set and fails because of a system error that prevents creation of an OpenVMS process, the job is rescheduled according to its interval; /RETRY has no impact on when the job is rescheduled. If the job was created with the /NORESTART qualifier, the job is put on hold and is not restarted automatically.

Tell Technical Publications how we can improve this information