If you frequently perform queries with search conditions specified on multiple columns, you should collect statistics on those columns jointly. Statistics collected in this manner permit the Optimizer to more accurately estimate the number of qualifying rows for queries that specify both of those columns. You cannot collect multicolumn statistics on global temporary tables.
Note that the statistics derived for column_1 and column_2 when collected jointly are different from those derived by collecting statistics separately on those same columns. For example, suppose you have two columns named FirstName and LastName. Specifically, the statistics collected on FirstName by itself provide an estimate of the number of individuals having the first name John, for example, and the statistics collected on LastName provide an estimate of the number of individuals having the last name Smith.
These statistics might not provide good estimates of the number of individuals named John Smith. For that, you need to collect statistics on the two columns jointly. This example is provided only to make a clear distinction between collecting statistics on individual columns and collecting statistics on those same columns jointly.
The following table provides a general guideline for determining how to collect statistics on multiple columns.
|IF a set of unindexed columns is …||THEN collect statistics on the …|
|frequently specified together as an equality condition or WHERE clause join condition||column set jointly.|
|infrequently specified together as an equality condition or WHERE clause join condition||individual columns separately.|
Vantage treats multicolumn statistics the same as index statistics if an index has been defined on the same column set.
For example, if you have defined an index on (x1, y1) in table t1, then the following two requests collect the identical statistics and store those statistics in DBC.StatsTbl.
COLLECT STATISTICS ON t1 COLUMN (x1,y1); COLLECT STATISTICS ON t1 INDEX (x1,y1);
As a general rule, if a column has many nulls and it is not specified frequently in predicates, you probably should not include it when you are collecting multicolumn statistics.
For example, suppose you are considering collecting multicolumn statistics on columns 1, 2, and 3, and their values are like the following, where a QUESTION MARK character represents a null.
|column 1||column 2||column 3|
You would probably not want to collect multicolumn statistics that include column 3 unless it is specified in a large number of predicates.