16.10 - Table Form: Data Demographics for Multicolumn Database Objects - Teradata Database

Teradata Database Design

Product
Teradata Database
Release Number
16.10
Release Date
June 2017
Content Type
User Guide
Publication ID
B035-1094-161K
Language
English (United States)

This topic describes how to determine data demographics for multicolumn objects such as composite indexes. Composite indexes are also referred to as multicolumn indexes.

No index of any kind can be based on columns typed as XML, BLOB, or CLOB.

Sources for the Required Information

The following table points to the sources for the multicolumn data demographics requested by the Table form. With the exception of the information documented here, sources are identical to those documented by Sources for the Required Information.

Information Source
Distinct values See the following topics.
Maximum rows/value
Rows/null
Typical rows/value

Example: Form for Two Multicolumn Index Candidates

This example form, which continues the form illustrated in Table Form Example indicates two multicolumn index candidates.

Information acquisition for the second of the two multicolumn specifications is made more simple because the specified columns constitute the primary key for the table.

Because the index candidates are both composite, you cannot use an identity column to define them.

This implies certain demographics by default, as described by the following list. The listed implications hold true irrespective of whether the primary key is uni- or multicolumnar.
  • All demographics must be unique.
  • No demographics can be null.

    Therefore, the value for Rows Null is always 0.

  • The value for Distinct Values always matches the Cardinality for the table and the value for Maximum Rows/Value is always 1.

Note that the value for Change Rating is also 0 by default because primary key values should never be changed.

(Multicolumn index considerations)

Page:   of

        ELDM Page: 

        System: Patient Tracking

Table Name:

Patient_Service_History

Table Type: History Cardinality: 730,000 Data Protection: RAID 5
Column Name Patient_ID

Service_ID

Patient_ID

Service_ID

Timestamp

 
PK/FK/ID FK PK  
       
Constraint Number      
Value Access Frequency   120,000  
Join Access Frequency 2,200 300,000  
Join Access Rows 200,000 300,000  
Distinct Values 8,000 730,000  
Maximum Rows/Value 365 1  
Maximum Rows Null 0 0  
Typical Rows/Value 90 1  
Change Rating 0 0  
PI/SI      

Demographic Trends as a Function of the Number of Columns

Certain characteristics of the demographics of a candidate index change in predictable ways as more columns are added to the index definition.

Selectivity is a measure of the ability of an index to return a highly discriminating subset of rows from a table. The higher the selectivity, the fewer rows retrieved.
  • The number of distinct values increases because each additional index column further enhances the singularity of the row.
  • The number of rows per value decreases because the number of values increases proportionately.
  • The selectivity of the candidate index increases because any set of index column values points to fewer rows.

The following table summarizes these trends:

Index Columns Number of Distinct Values Number of Rows Per Value Selectivity
State Least Most Lowest
State + Zip Code ? ? ?
State + Zip Code + Last Name Most Least Highest

Calculating Multicolumn Demographics

You must have an understanding of the underlying data before you can undertake a legitimate calculation of the demographics for multicolumn situations.

The following table provides some calculation guidelines for determining the number of distinct values in a binary (two column) situation. The guidelines scale as more columns are added to the candidate index.

IF distinct column_1 values … THEN the total number of distinct values in the two-column index equals … ELSE the number of distinct values is …
can be paired with distinct values from column_2 the product of the number of distinct values in column_1 and the number of distinct values in column_2 approximately equal to the product of the number of distinct values in column_1 and the average number of values in column_2 that can be paired with a single column_1 value.

Example Cases for These Guidelines

The following examples illustrate the guidelines stated in Calculating Multicolumn Demographics.

  • Consider the following table fragment. The candidate index for which multicolumn demographics are to be estimated is (ProductCode, ColorCode).
  ProductCode ColorCode
 
 

The product (ProductCode) * (ColorCode) is true iff every product is available in every color.

  • Consider the following table fragment The candidate index for which multicolumn demographics are to be estimated is (StateCode, ZipCode).
  StateCode ZipCode
 
 
  • The product (StateCode) * (ZipCode) is not an accurate estimate of the number of distinct values in the multicolumn index because the resulting product has fewer distinct values than the product of distinct StateCode and distinct ZipCode values.
  • If you cannot pair every value in the StateCode column with every value in the ZipCode column, then the number of distinct values in the multicolumn index is approximately equal to (Number of distinct StateCode values) * (Average number of ZipCode values that can be paired with a single StateCode value).

    For example, the number of distinct values is the product of the number of states in the US and the average number of zip codes per state, which we estimate to be 20.

    Therefore, the number of distinct values for the StateCode + ZipCode index would be roughly 50 * 20, or 1,000.

  • The same considerations hold most of the other rows per value calculations.

    For example, if every value of column_1 can be paired validly with every value for column_2, then the result of the division is a fairly accurate estimate.

    For each value estimated, the minimum is never less than the number of column_1 rows per value divided by the number of distinct column_2 values.

    The exception is the count of Rows Null (labeled Rows/Null). See the next section for information on handling multicolumn indexes that are wholly or partially null.

Treating the Rows Null Counts for Composite Indexes

An index can be partly or wholly null. This is almost never a good idea, however. See Designing for Missing Information for information about SQL nulls and the many problems they cause with database integrity.

Consider the following example:

employee_number department_number
PK
FK FK
103794 7012
? 7012
105378 ?
? ?
  • The first row has both employee and department numbers.
  • The second row has a null employee number and a department number.
  • The third row has an employee number and a null department number.
  • The fourth row has nulls for both its employee number and its department number.

All four rows are valid primary index rows. Should they all contribute to the count for the Rows Null value or should they be treated differently?

Only the first of these 4 rows is a valid primary key because by the entity integrity rule (see Rules for Primary Keys), a primary key cannot contain nulls by definition. This is only one of the many reasons that indexes should not be an issue during logical database design.

The following table provides recommended treatment guidelines for partially and wholly null composite indexes. Composite indexes are also referred to as multicolumn indexes.

IF a composite index value is … THEN you should …
wholly null add it to the Rows Null count.
partially null not add it to the Rows Null count.

The Optimizer can detect the number of distinct values from the rows that are partially null and it tracks them separately from the rows that are wholly null. If a column has many nulls, but is not usually specified in predicates, you probably should not include that column in the multicolumn statistics you collect.

For example, suppose you are considering collecting multicolumn statistics on columns 1, 2, and 3 of a table, and their values are like the following, where a QUESTION MARK character represents a null.

  column 1 column 2 column 3
  1 2 3
  ? 2 3
  1 ? 3
  1 2 ?
  ? ? ?
  1 2 ?
  1 2 ?
  1 2 ?
  1 2 ?
  1 2 ?

You would probably not want to collect multicolumn statistics that include column 3 unless it is specified in a large number of predicates.