This topic describes how to determine data demographics for multicolumn objects such as composite indexes. Composite indexes are also referred to as multicolumn indexes.
No index of any kind can be based on columns typed as XML, BLOB, or CLOB.
Sources for the Required Information
The following table points to the sources for the multicolumn data demographics requested by the Table form. With the exception of the information documented here, sources are identical to those documented by Sources for the Required Information.
Information  Source 

Distinct values  See the following topics. 
Maximum rows/value  
Rows/null  
Typical rows/value 
Example: Form for Two Multicolumn Index Candidates
This example form, which continues the form illustrated in Table Form Example indicates two multicolumn index candidates.
Information acquisition for the second of the two multicolumn specifications is made more simple because the specified columns constitute the primary key for the table.
Because the index candidates are both composite, you cannot use an identity column to define them.
 All demographics must be unique.
 No demographics can be null.
Therefore, the value for Rows Null is always 0.
 The value for Distinct Values always matches the Cardinality for the table and the value for Maximum Rows/Value is always 1.
Note that the value for Change Rating is also 0 by default because primary key values should never be changed.
(Multicolumn index considerations)
Page: of 
ELDM Page:
System: Patient Tracking 

Table Name:
Patient_Service_History 
Table Type: History  Cardinality: 730,000  Data Protection: RAID 5 
Column Name  Patient_ID Service_ID 
Patient_ID Service_ID Timestamp 

PK/FK/ID  FK  PK  
Constraint Number  
Value Access Frequency  120,000  
Join Access Frequency  2,200  300,000  
Join Access Rows  200,000  300,000  
Distinct Values  8,000  730,000  
Maximum Rows/Value  365  1  
Maximum Rows Null  0  0  
Typical Rows/Value  90  1  
Change Rating  0  0  
PI/SI 
Demographic Trends as a Function of the Number of Columns
Certain characteristics of the demographics of a candidate index change in predictable ways as more columns are added to the index definition.
 The number of distinct values increases because each additional index column further enhances the singularity of the row.
 The number of rows per value decreases because the number of values increases proportionately.
 The selectivity of the candidate index increases because any set of index column values points to fewer rows.
The following table summarizes these trends:
Index Columns  Number of Distinct Values  Number of Rows Per Value  Selectivity 

State  Least  Most  Lowest 
State + Zip Code  ?  ?  ? 
State + Zip Code + Last Name  Most  Least  Highest 
Calculating Multicolumn Demographics
You must have an understanding of the underlying data before you can undertake a legitimate calculation of the demographics for multicolumn situations.
The following table provides some calculation guidelines for determining the number of distinct values in a binary (two column) situation. The guidelines scale as more columns are added to the candidate index.
IF distinct column_1 values …  THEN the total number of distinct values in the twocolumn index equals …  ELSE the number of distinct values is … 

can be paired with distinct values from column_2  the product of the number of distinct values in column_1 and the number of distinct values in column_2  approximately equal to the product of the number of distinct values in column_1 and the average number of values in column_2 that can be paired with a single column_1 value. 
Example Cases for These Guidelines
The following examples illustrate the guidelines stated in Calculating Multicolumn Demographics.
 Consider the following table fragment. The candidate index for which multicolumn demographics are to be estimated is (ProductCode, ColorCode).
…  ProductCode  ColorCode  …  
…  …  …  …  
…  …  …  … 
The product (ProductCode) * (ColorCode) is true iff every product is available in every color.
 Consider the following table fragment The candidate index for which multicolumn demographics are to be estimated is (StateCode, ZipCode).
…  StateCode  ZipCode  …  
…  …  …  …  
…  …  …  … 
 The product (StateCode) * (ZipCode) is not an accurate estimate of the number of distinct values in the multicolumn index because the resulting product has fewer distinct values than the product of distinct StateCode and distinct ZipCode values.
 If you cannot pair every value in the StateCode column with every value in the ZipCode column, then the number of distinct values in the multicolumn index is approximately equal to (Number of distinct StateCode values) * (Average number of ZipCode values that can be paired with a single StateCode value).
For example, the number of distinct values is the product of the number of states in the US and the average number of zip codes per state, which we estimate to be 20.
Therefore, the number of distinct values for the StateCode + ZipCode index would be roughly 50 * 20, or 1,000.
 The same considerations hold most of the other rows per value calculations.
For example, if every value of column_1 can be paired validly with every value for column_2, then the result of the division is a fairly accurate estimate.
For each value estimated, the minimum is never less than the number of column_1 rows per value divided by the number of distinct column_2 values.
The exception is the count of Rows Null (labeled Rows/Null). See the next section for information on handling multicolumn indexes that are wholly or partially null.
Treating the Rows Null Counts for Composite Indexes
An index can be partly or wholly null. This is almost never a good idea, however. See Designing for Missing Information for information about SQL nulls and the many problems they cause with database integrity.
Consider the following example:
employee_number  department_number 
PK  
FK  FK 
103794  7012 
?  7012 
105378  ? 
?  ? 
 The first row has both employee and department numbers.
 The second row has a null employee number and a department number.
 The third row has an employee number and a null department number.
 The fourth row has nulls for both its employee number and its department number.
All four rows are valid primary index rows. Should they all contribute to the count for the Rows Null value or should they be treated differently?
The following table provides recommended treatment guidelines for partially and wholly null composite indexes. Composite indexes are also referred to as multicolumn indexes.
IF a composite index value is …  THEN you should … 

wholly null  add it to the Rows Null count. 
partially null  not add it to the Rows Null count. 
The Optimizer can detect the number of distinct values from the rows that are partially null and it tracks them separately from the rows that are wholly null. If a column has many nulls, but is not usually specified in predicates, you probably should not include that column in the multicolumn statistics you collect.
For example, suppose you are considering collecting multicolumn statistics on columns 1, 2, and 3 of a table, and their values are like the following, where a QUESTION MARK character represents a null.
column 1  column 2  column 3  
1  2  3  
?  2  3  
1  ?  3  
1  2  ?  
?  ?  ?  
1  2  ?  
1  2  ?  
1  2  ?  
1  2  ?  
1  2  ? 
You would probably not want to collect multicolumn statistics that include column 3 unless it is specified in a large number of predicates.