Cardinality Statistics

SQL and Access Paths › Importance of Catalog Statistics › Cardinality Statistics

Cardinality Statistics

Cardinality statistics reflect the number of rows in a table or the number of distinct values for a column in a table. These statistics provide the main source for the filter factor. Filter factor is a percentage of the number of rows that are expected to be returned for a column or a table. DB2 uses cardinality statistics to determine which of the following methods to use to access data:

Whether access is using an index scan or a table space scan
Which table to access first when joining tables

DB2 needs an accurate estimate of the number of rows that qualify after applying various predicates in your queries to determine the optimal access path. When multiple tables are accessed, the number of qualifying rows that are estimated for the tables can also affect the table access sequence. Column and table cardinalities provide the basic, but critical information for the DB2 optimizer to make these estimates.

Example: EMP Table Access

When DB2 has to choose the most efficient way to access the EMP table, this access depends on the following factors:

Size of the table
Cardinality of the SEX column
Any index that might be defined on the table

DB2 uses the catalog statistics to determine the filter factor for the SEX column. In this case, 1/COLCARDF, COLCARDF is a column in the SYSIBM.SYSCOLUMNS catalog table. The resulting fractional number represents the number of rows that are expected to be returned based on the predicate. DB2 uses this filter factor number to decide whether to use an index (if available) or table space scan.

Example: EMP Table Access Path

If the cardinality of the SEX column is three (male, female, and unknown), the table has three unique occurrences of a value of SEX. In this case, DB2 determines a filter factor of 1/3, or 33 percent of the values. Consider that the cardinality of the table, as reflected in the CARDF column of the SYSIBM.SYSTABLES catalog table, is 10,000 employees. Then, the estimated number of rows that are returned from the query is 3,333. DB2 uses this information to decide which access path to choose.

Consider that you are using the predicate in the following query in a join to another table. The filter factor that is calculated will be used not only to determine the access path to the EMP table, but it could also influence which table will be accessed first in the join.

SELECT EMPNO
FROM   EMP
WHERE  SEX = :SEX

Example: Column Groups

Statistics can also be gathered in groups of columns. This method is called column correlation statistics. This method is important because the DB2 optimizer can use only the information it is given. The following command shows an example that uses this method:

SELECT E.EMPNO, D.DEPTNO, D.DEPTNAME
FROM   EMP E
INNER JOIN
       DEPT D
ON     E.WORKDEPT = D.DEPTNO
WHERE  LASTNAME   = :LAST-NAME

Example: Filter Factor

Imagine that the amount of money that someone is paid is correlated to the amount of education they have received. For example, an intern with one year of college is not paid as much as an executive with an MBA. However, if the DB2 optimizer is not given this information, it has to multiply the filter factor for the SALARY predicate by the filter factor for the EDLEVEL predicate. This calculation may result in an exaggerated filter factor for the table, and it could negatively influence the choice of an index or the table access sequence of a join. For this reason, it is important to gather column correlation statistics on any columns that might be correlated. More conservatively, you may want to gather column correlation statistics on any columns that are referenced together in the WHERE clause. The following command shows an example that uses this method:

SELECT E.EMPNO
FROM   EMP E
WHERE  SALARY  > :SALARY
AND    EDLEVEL = :EDLEVEL