Error Recovery

IBM i General Design Standards › Design Standards for Programs › Error Recovery

Error Recovery

When designing an application, you should consider what would happen when an error occurs, both normally (data validation error), and abnormally (system crash).

The following are some principles that can be applied when designing for error recovery. Refer to the section on ‘System Recovery’ for a general discussion of recovery considerations.

In the event of a crash, programs should always collapse to a safe point that is one where no special corrective intervention will be required to synchronize the database. Commit control can be used to ensure that this happens, even on transactions involving many updates to the database.

Decide what the recovery unit will be should a crash occur. A critical consideration is usually whether the whole file can be regarded as recoverable as a single unit or not; this is normally equivalent to considering whether many users will be using the file at the same time.

If the file may be regarded as a single recovery unit; for example, during its use for update by a batch process, the whole file may be restored from a backup copy, taken at the start of the process.

If the whole file cannot be restored, say because of locks likely to be held by other users, (for instance as when one of several interactive programs using a file fails), the recovery unit cannot be the whole file. Journaling can be used to select a recovery unit within a file—recovery units can range from the whole job down to an individual access to the database. Commit control can be used to group individual database accesses into functionally useful recovery units (for example, a whole batch of transactions).

Make programs restartable. Programs should be written so that when they are rerun after a crash, they pick up where they left off, and resume processing.

You should be able to reassure yourself that a system is synchronized after a crash—provide inquiry programs and integrity checkers.