Clustering and Partitioning

Designing Tables and Indexes for Performance › Table Space Performance Recommendations › Clustering and Partitioning

Clustering and Partitioning

The clustering index is not always the primary key. It is generally a sequential range retrieval key, and should be chosen by the most frequent range access to the table data. Range and sequential retrieval are the primary requirements, but partitioning is an important requirement and can be the most critical requirement, especially as tables get extremely large. If you do not specify an explicit clustering index, DB2 clusters by the index that is the oldest by definition (often referred to as the first index created). If the oldest index is dropped and recreated, that index will now be a new index and clustering will now be by the next oldest index.

The basic rule to clustering is that if your application will have a certain sequential access pattern or a regular batch process, you should cluster the data according to that input sequence.

Clustering and partitioning can be independent, and a log of options is available for organizing your data as follows:

In a single dimension (clustering and partitioning are based on the same key)
Dual dimensions (clustering inside each partition by a different key)
Multiple dimensions (combining different tables with different partitioning unioned inside a view).

You should choose a partitioning strategy that is based on a concept of application-controlled parallelism, separating old and new data, grouping data by time, or grouping data by some meaningful business entity (for example, sales region or office location). Within those partitions, you can cluster the data by your most common sequential access sequence.

Note: For more information about dismissing clustering for inserts, see Append Processing for High Volume Inserts.

For large tables, partitioning is the only way to store large amounts of data, but partitioning also has advantages for smaller tables. Consider the following:

DB2 lets you define up to 4096 partitions of up to 64 GB each; however, total table size is limited depending on the DSSIZE specified. Non-partitioned table spaces are limited to 64 GB of data.
You can take advantage of the ability to execute utilities on separate partitions in parallel. This practice also lets you access data in certain partitions while utilities are executing on others.
In a data-sharing environment, you can spread partitions among several members to split workloads.
You can also spread your data over multiple volumes and need not use the same storage group for each data set belonging to the table space. This practice also lets you place frequently accessed partitions on faster devices.