If an I/O error is encountered when accessing a journal file, the system responds differently depending on whether a read or write error is encountered:
The DC/UCF system will continue to operate without the use of the damaged journal file, although processing may be slower due to the availability of fewer journal files.
Automatic Recovery Failure
If a transaction abends (or issues a rollback) and, in order to recover, CA IDMS/DB must access a disabled journal file, it places the failing transaction in a suspended state and issues the following message to the log:
DC205009 Transaction suspended. Transaction id xxxxxx
Recovery Procedure Steps
To recover from an I/O error on a journal file, follow these steps:
|
Action |
Statement |
|---|---|
|
1. Vary the affected journal file offline |
DCMT VARY JOURNAL FILE OFFLINE |
|
2. Monitor the status of the journal file within the system. |
DCMT DISPLAY JOURNAL FILE |
|
3. If the journal file's status changes to OFFLINE, continue with Step 4. Otherwise, perform the steps outlined for unable to reach offline status. |
|
|
4. Identify the problem and correct it. If the problemis not associated with the journal file itself (the problem is due to a bad channel for example), correct the problem and continue with Step 5. If the problem is due to a damaged file, proceed with the steps outlined in "Repairing a Damaged Journal File." |
|
|
5 Vary the affected journal file online. |
DCMT VARY JOURNAL FILE ONLINE |
If the Journal Does Not Reach Offline Status
There are two conditions that may prevent a journal file from reaching an offline status:
If the journal file does not reach offline status, take the following actions:
|
Action |
Statement |
|---|---|
|
1. Determine if active transactions still depend on the journal file. |
DCMT DISPLAY JOURNAL FILE PENDING TRANSACTIONS |
|
2. If no pending transactions exist: |
|
|
2.1 Offload the journal file |
ARCHIVE JOURNAL |
|
2.2 If the archive is successful, the journal file will reach offline status, so proceed with Step 4 in the preceding table. |
|
|
2.3 If the archive is not successful, perform a quiescedbackup of all areas that were in update mode at the time of the I/O error and then proceed with the steps outlined in "Repairing a Damaged Journal File." |
Various See "Quiesced Backup Procedure" in 21.2, “Backup Procedures". |
|
3. If pending suspended transactions exist, quiesce all update activity within the system. |
DCMT VARY AREA OR SEGMENT |
|
4. If pending non-suspended transactions exist: |
|
|
4.1 Wait for them to complete. Do not cancel them. |
|
|
4.2 If there are pending InDoubt distributed transactions, try to complete them by initiating resynchronization with their coordinator. |
Various See 21.3.3, “Resynchronization” |
|
5. If only pending suspended transactions exist, cancel the system and proceed with the steps outlined in "Manual Recovery Following a Journal File I/O Error." |
Operating system facilities |
Repairing a Damaged Journal File
Take the following actions to repair a damaged journal file while the DC/UCF system remains active:
|
Action |
Statement |
|---|---|
|
1. De-allocate the journal file |
DCMT VARY JOURNAL FILE DEALLOCATE Use the FORCE option if the file cannot be closed (for example, because of a channel problem) |
|
2. Allocate a new journal file; if the FORCE option was used in Step 1, create the file with a new name |
Operating system facilities |
|
3. Format the new journal file |
FORMAT JOURNAL |
|
4. If the file was allocated in a new location or with a new name
Update the standard labels in z/VSE |
Operating system facilities |
|
5. If the file was allocated with a new name, make the name known to the DC/UCF system |
DCMT VARY JOURNAL FILE DSNAME |
|
6. Make the new file available to the DC/UCF system |
DCMT VARY JOURNAL FILE ONLINE |
Considerations for Renaming the File
If you allocated the new journal file using a new name, you must make sure that the correct file is used the next time the system is started. If change tracking is in effect for the DC/UCF system, CA IDMS automatically ensures that the correct file is used when the system is restarted following an abnormal termination. However, if change tracking is not in use or if you shutdown the system, you must do one of the following:
If you fail to do one of the above, CA IDMS attempts to access the wrong journal file the next time the system is started. This may have serious consequences if the original file still exists.
More Information
Manual Recovery Following a Journal File I/O Error
If one or more transactions cannot be rolled back due to their dependence on the damaged journal file, take the following actions to complete the recovery process:
|
Action |
Statement |
|---|---|
|
1. Restore all areas that were open at the time of the I/O error (including load and queue areas) |
RESTORE |
|
2. Check for incomplete InDoubt transactions using the archive journal files created since each backup was taken |
PRINT JOURNAL or FIX ARCHIVE |
|
3. If incomplete InDoubt transactions exist, complete them manually by creating a new archive file. |
FIX ARCHIVE using manual recovery control file input to complete the InDoubt transactions |
|
4. Roll forward all restored areas using the archive journal files created since each backup was taken or the corrected file created in the preceding step |
ROLLFORWARD with the COMPLETED and AREA options |
|
5. Initialize all journal files |
FORMAT JOURNAL with the ALL option |
|
6. Backup all recovered database areas |
BACKUP with the FILE option |
|
7. Re-start the system |
|
|
8. Re-run all transactions that were not recovered |
|
Considerations
Quiescing System Activity
In a data sharing environment, it is important to quiesce a shared area in all members of the data sharing group. The broadcase capability of DCMT commands can be used to do this easily.
Conservative Approach
The steps outlined above take a conservative approach to the recovery process in two ways:
Distributed Transaction Considerations
If journal information was lost due to the I/O error and areas had to be restored, any transactions whose journal images were lost were effectively backed out. This can lead to a mixed result for distributed transactions, since changes on other systems may have been committed. Unfortunately, there may be no way to determine what other systems are impacted due to the loss in journal information.
If you do know of another system that might be involved in one of these transactions, use their journal or log information to identify distributed transactions impacted by the failure. Look for distributed transactions in which the failed system was involved (as either a participant or a coordinator) and that were committed subsequent to the point to which an impacted area was restored.
|
Copyright © 2014 CA.
All rights reserved.
|
|