This topic describes how to determine data demographics for multicolumn objects such as composite indexes. Composite indexes are also referred to as multicolumn indexes.
No index of any kind can be based on columns typed as XML, BLOB, or CLOB.
Sources for the Required Information
The following table points to the sources for the multicolumn data demographics requested by the Table form. With the exception of the information documented here, sources are identical to those documented by Sources for the Required Information.
Example: Form for Two Multicolumn Index Candidates
This example form, which continues the form illustrated in Table Form Example indicates two multicolumn index candidates.
Information acquisition for the second of the two multicolumn specifications is made more simple because the specified columns constitute the primary key for the table.
Because the index candidates are both composite, you cannot use an identity column to define them.
- All demographics must be unique.
- No demographics can be null.
Therefore, the value for Rows Null is always 0.
- The value for Distinct Values always matches the Cardinality for the table and the value for Maximum Rows/Value is always 1.
Note that the value for Change Rating is also 0 by default because primary key values should never be changed.
(Multicolumn index considerations)
System: Patient Tracking
|Table Type: History||Cardinality: 730,000||Data Protection: RAID 5|
|Value Access Frequency||120,000|
|Join Access Frequency||2,200||300,000|
|Join Access Rows||200,000||300,000|
|Maximum Rows Null||0||0|
Demographic Trends as a Function of the Number of Columns
Certain characteristics of the demographics of a candidate index change in predictable ways as more columns are added to the index definition.
- The number of distinct values increases because each additional index column further enhances the singularity of the row.
- The number of rows per value decreases because the number of values increases proportionately.
- The selectivity of the candidate index increases because any set of index column values points to fewer rows.
The following table summarizes these trends:
|Index Columns||Number of Distinct Values||Number of Rows Per Value||Selectivity|
|State + Zip Code||?||?||?|
|State + Zip Code + Last Name||Most||Least||Highest|
Calculating Multicolumn Demographics
You must have an understanding of the underlying data before you can undertake a legitimate calculation of the demographics for multicolumn situations.
The following table provides some calculation guidelines for determining the number of distinct values in a binary (two column) situation. The guidelines scale as more columns are added to the candidate index.
|IF distinct column_1 values …||THEN the total number of distinct values in the two-column index equals …||ELSE the number of distinct values is …|
|can be paired with distinct values from column_2||the product of the number of distinct values in column_1 and the number of distinct values in column_2||approximately equal to the product of the number of distinct values in column_1 and the average number of values in column_2 that can be paired with a single column_1 value.|
Example Cases for These Guidelines
The following examples illustrate the guidelines stated in Calculating Multicolumn Demographics.
- Consider the following table fragment. The candidate index for which multicolumn demographics are to be estimated is (ProductCode, ColorCode).
The product (ProductCode) * (ColorCode) is true iff every product is available in every color.
- Consider the following table fragment The candidate index for which multicolumn demographics are to be estimated is (StateCode, ZipCode).
- The product (StateCode) * (ZipCode) is not an accurate estimate of the number of distinct values in the multicolumn index because the resulting product has fewer distinct values than the product of distinct StateCode and distinct ZipCode values.
- If you cannot pair every value in the StateCode column with every value in the ZipCode column, then the number of distinct values in the multicolumn index is approximately equal to (Number of distinct StateCode values) * (Average number of ZipCode values that can be paired with a single StateCode value).
For example, the number of distinct values is the product of the number of states in the US and the average number of zip codes per state, which we estimate to be 20.
Therefore, the number of distinct values for the StateCode + ZipCode index would be roughly 50 * 20, or 1,000.
- The same considerations hold most of the other rows per value calculations.
For example, if every value of column_1 can be paired validly with every value for column_2, then the result of the division is a fairly accurate estimate.
For each value estimated, the minimum is never less than the number of column_1 rows per value divided by the number of distinct column_2 values.
The exception is the count of Rows Null (labeled Rows/Null). See the next section for information on handling multicolumn indexes that are wholly or partially null.
Treating the Rows Null Counts for Composite Indexes
An index can be partly or wholly null. This is almost never a good idea, however. See Designing for Missing Information for information about SQL nulls and the many problems they cause with database integrity.
Consider the following example:
- The first row has both employee and department numbers.
- The second row has a null employee number and a department number.
- The third row has an employee number and a null department number.
- The fourth row has nulls for both its employee number and its department number.
All four rows are valid primary index rows. Should they all contribute to the count for the Rows Null value or should they be treated differently?
The following table provides recommended treatment guidelines for partially and wholly null composite indexes. Composite indexes are also referred to as multicolumn indexes.
|IF a composite index value is …||THEN you should …|
|wholly null||add it to the Rows Null count.|
|partially null||not add it to the Rows Null count.|
The Optimizer can detect the number of distinct values from the rows that are partially null and it tracks them separately from the rows that are wholly null. If a column has many nulls, but is not usually specified in predicates, you probably should not include that column in the multicolumn statistics you collect.
For example, suppose you are considering collecting multicolumn statistics on columns 1, 2, and 3 of a table, and their values are like the following, where a QUESTION MARK character represents a null.
|column 1||column 2||column 3|
You would probably not want to collect multicolumn statistics that include column 3 unless it is specified in a large number of predicates.