Parallel Data Reorganization

REORG (Parallel BACKUP) › REORG › Parallel Data Reorganization

Parallel Data Reorganization

The parallel data load and parallel index update are requested by not specifying OPTION2=BACKUPONLY. They cannot be separated, and, if either fails, they must be restarted specifying OPTION2=LOADONLY.

The REORG function expects to process against a loaded data area with a loaded index. However, its ability to rerun from a previous error causes the function not to edit for these requirements.

The data load portion of REORG accepts the number of input files to be processed based upon input parameters. REORG starts a sub-task to process each input in parallel. Each of the subtasks opens a specific input file with a range of data rows (in native sequence) and adds them to the data area. The content of the data area at the start of the REORG (beyond the required data area control block) is not important. The content is completely replaced by the input data and empty space (low values). All of the subtasks execute simultaneously and/or concurrently, based upon processor availability.

Each subtask expects to load its share of the data, but because the amount of data within key ranges may vary, each subtask works with zero rows through all the rows in the area. If no input row is found by every subtask, an error is recognized and the parallel reorganization is aborted, that is, the REORG utility cannot be used to null load an area.

The parallel data reorganization is based upon the principle that the data needs to be in better native sequence key order, not in perfect native sequence key order. For perfect order, use the normal native sequence backup with a standard DBUTLTY LOAD function. The non-perfect ordering can be thought of as groups of rows ordered within the group by Native Key sequence.

Each subtask knows only the range of key values it is to read. It does not know how they compare to other subtasks. Also unknown is the number of rows each sub-task processes. Therefore, the basic flow of processing is for each subtask with a row(s) to add to acquire a logical group of tracks of the data area for its data. It adds its data in this group of tracks until it runs out of data or its group of tracks becomes full. If its group of tracks becomes full, the subtask gets another group of tracks. Because all the subtasks are working independently, rows within a group of tracks are in perfect order based upon the native sequence key values, but the Native Key sequence varies between groups.

We do not recommend using extracted files for input to the REORG utility for the following reasons:

The table must have only unique native sequence keys. The native sequence key is defined with unique, or the native sequence key ID must be the same as the Master Key ID, and the Master Key must be defined as allowing no duplicate Master Keys.
The REORG executes much slower, with additional I/O and CPU usage, when using extracted files than when using backup files. This is because, during the data reorganization, the key value must be extracted from each row and an index look-up done to discover the Unique Row Identifier (URI).
If a duplicate is found or no index entry is found, the REORG utility stops immediately, leaving the data area unuseable, partly loaded with reorganized data, and partly with old data. At this point, the input data would have to be concatenated into the normal LOAD utility.

The default group is 15 tracks, but you can define it, by using the optional NUMBER= keyword, to be any desired number of tracks. A group of tracks acquired by a subtask for its use are not available to any other subtask. This requires that the data area must have sufficient free space, at completion of the data load, to allow a partial group for every subtask. Take this into account when selecting the number of subtasks. It is also required that at least one track be available for processing (not in a subtask group), once all the subtasks have completed.

The input data must be for an area loaded with URI=YES. The data area cannot use the DATASPACE option (DSOP=3) for clustering. The data area cannot contain a direct key table. If the input is in extract format, the data area being loaded must not contain a table defined as containing variable rows, or a table that allows duplicate native sequence key values. Extract format data also requires that the REORG build the key value from each row and perform an index look-up to find the URI of the rows.

As each row is placed in a data block, a memory table is updated with the block to which this row was, by URI, moved. The block in which the row originally occurred is not known and is not important. The memory table is stored in a series of data spaces. The number used is the minimum needed to contain the required information, based upon the highest URI in the area and the number of data blocks in the data area. Every URI number possible, from one through the highest currently in the area (as reported in the Directory (CXX) report), requires three or four bytes of space. Data areas of less than 16 million blocks require three bytes per URI. Larger data sets require 4 bytes per URI. The maximum number of data spaces that can be used is eight, but one or two are, typically, utilized.

The REORG utility expects to be executed with more than one subtask and to be provided multiple input files from multiple parallel backups or extracts, but it executes as a single task, with backup or extract input from a normal non-parallel backup or extract.

The parallel data load does not rebuild the free space index. At completion of the data load, the database can be opened for update. Any added rows go to the free space beyond the last assigned group of tracks. As soon as possible (next step), a DBUTLTY RETIX function should be executed with KEYNAME=*DATA, to rebuild the space index. The DBUTLTY RETIX function is a Multi-User Facility function. The database can be open and in the act of being updated during this process.

During the load phase, while any of the tasks still have data to load, the status of the execution can be requested by issuing an Operating System modify command to the utility with a STATUS command. This presumes the DBSYSID macro, producing a DBSIDPR program, had the CONSOLE=YES option selected. In the load phase, this message is generated:

DB01320I - REORG BASE n AREA x ON BLOCK y OF z RECORD r

In the message, n is the DBID, x is the area name, y is the highest block taken by any of the 1-25 load subtasks, z is the number of blocks initialized, and r is the number of records loaded by any of the 1-25 load subtasks.