Previous Topic: Cardinality StatisticsNext Topic: Histogram Statistics


Frequency Distribution Statistics

Frequency distribution statistics reflect the percentage of frequently occurring values. You can collect the information about the percentage of values that occur most or least frequently in a column in a table. These frequency distribution statistics can dramatically impact the performance of the following types of queries:

Note: DB2 can collect distribution statistics using the RUNSTATS utility.

In some cases, cardinality statistics are not enough. With only cardinality statistics, the DB2 optimizer has to assume that the data is evenly distributed. For example, consider the SEX column from the previous queries. The column cardinality is 3. However, it is common that the values in the SEX column will be M for male or F for female, with each representing approximately 50 percent of the values. The value U occurs only a small fraction of the time. This example is called skewed distribution.

Skewed distributions can have a negative influence on query performance if DB2 does not know about the distribution of the data in a table, the input values in a query predicate, or both. In the following query:

SELECT EMPNO
FROM   EMP
WHERE  SEX = :SEX

Likewise, if the following statement is issued:

SELECT EMPNO
FROM   EMP
WHERE  SEX = 'U'

Most of the values are M or F, and DB2 has only cardinality statistics available, then the same formula and filter factor applies. In such situations, the distribution statistics can pay off.

If frequency distribution statistics for the SEX column are gathered in the previous examples, the following frequency values in the SYSIBM.SYSCOLDIST table may occur (simplified in this example):

VALUE

FREQUENCY

M

.49

F

.49

U

.01

If DB2 has this information, and if the following query is issued:

SELECT EMPNO
FROM   EMP
WHERE  SEX = 'U'

DB2 can determine that the value U represents about one percent of the values of the SEX column. This additional information can have dramatic effects on the access path and table access sequence decisions.