Table Form: Data Demographics for Multicolumn Database Objects - Advanced SQL Engine

Table Form: Data Demographics for Multicolumn Database Objects - Advanced SQL Engine - Teradata Database

Database Design

Product

Advanced SQL Engine

Teradata Database

Release Number

17.10

Published

July 2021

Language

English (United States)

Last Update

2021-07-27

dita:mapPath

kko1591750222108.ditamap

dita:ditavalPath

kko1591750222108.ditaval

dita:id

B035-1094

lifecycle

Product Category

Teradata Vantage™

This topic describes how to determine data demographics for multicolumn objects such as composite indexes. Composite indexes are also referred to as multicolumn indexes.

No index of any kind can be based on columns typed as XML, BLOB, or CLOB.

Sources for the Required Information

The following table points to the sources for the multicolumn data demographics requested by the Table form. With the exception of the information documented here, sources are identical to those documented by Sources for the Required Information.

Information	Source
Distinct values	See the following topics. Calculating Multicolumn Demographics Example Cases for These Guidelines
Maximum rows/value
Rows/null
Typical rows/value

Example: Form for Two Multicolumn Index Candidates

This example form, which continues the form illustrated in Table Form Example indicates two multicolumn index candidates.

Information acquisition for the second of the two multicolumn specifications is made more simple because the specified columns constitute the primary key for the table.

Because the index candidates are both composite, you cannot use an identity column to define them.

This implies certain demographics by default, as described by the following list. The listed implications hold true irrespective of whether the primary key is uni- or multicolumnar.

All demographics must be unique.
No demographics can be null.
Therefore, the value for Rows Null is always 0.
The value for Distinct Values always matches the Cardinality for the table and the value for Maximum Rows/Value is always 1.

Note that the value for Change Rating is also 0 by default because primary key values should never be changed.

(Multicolumn index considerations) Page: of		ELDM Page: System: Patient Tracking
Table Name: Patient_Service_History	Table Type: History	Cardinality: 730,000	Data Protection: RAID 5
Column Name	Patient_ID Service_ID	Patient_ID Service_ID Timestamp
PK/FK/ID	FK	PK

Constraint Number
Value Access Frequency		120,000
Join Access Frequency	2,200	300,000
Join Access Rows	200,000	300,000
Distinct Values	8,000	730,000
Maximum Rows/Value	365	1
Maximum Rows Null	0	0
Typical Rows/Value	90	1
Change Rating	0	0
PI/SI

Demographic Trends as a Function of the Number of Columns

Certain characteristics of the demographics of a candidate index change in predictable ways as more columns are added to the index definition.

Selectivity is a measure of the ability of an index to return a highly discriminating subset of rows from a table. The higher the selectivity, the fewer rows retrieved.

The number of distinct values increases because each additional index column further enhances the singularity of the row.
The number of rows per value decreases because the number of values increases proportionately.
The selectivity of the candidate index increases because any set of index column values points to fewer rows.

The following table summarizes these trends:

Index Columns	Number of Distinct Values	Number of Rows Per Value	Selectivity
State	Least	Most	Lowest
State + Zip Code	?	?	?
State + Zip Code + Last Name	Most	Least	Highest

Calculating Multicolumn Demographics

You must have an understanding of the underlying data before you can undertake a legitimate calculation of the demographics for multicolumn situations.

The following table provides some calculation guidelines for determining the number of distinct values in a binary (two column) situation. The guidelines scale as more columns are added to the candidate index.

IF distinct column_1 values …	THEN the total number of distinct values in the two-column index equals …	ELSE the number of distinct values is …
can be paired with distinct values from column_2	the product of the number of distinct values in column_1 and the number of distinct values in column_2	approximately equal to the product of the number of distinct values in column_1 and the average number of values in column_2 that can be paired with a single column_1 value.

Example Cases for These Guidelines

The following examples illustrate the guidelines stated in Calculating Multicolumn Demographics.

Consider the following table fragment. The candidate index for which multicolumn demographics are to be estimated is (ProductCode, ColorCode).

…	ProductCode	ColorCode	…
…	…	…	…
…	…	…	…

The product (ProductCode) * (ColorCode) is true if every product is available in every color.

Consider the following table fragment The candidate index for which multicolumn demographics are to be estimated is (StateCode, ZipCode).

…	StateCode	ZipCode	…
…	…	…	…
…	…	…	…

The product (StateCode) * (ZipCode) is not an accurate estimate of the number of distinct values in the multicolumn index because the resulting product has fewer distinct values than the product of distinct StateCode and distinct ZipCode values.
If you cannot pair every value in the StateCode column with every value in the ZipCode column, then the number of distinct values in the multicolumn index is approximately equal to (Number of distinct StateCode values) * (Average number of ZipCode values that can be paired with a single StateCode value).
For example, the number of distinct values is the product of the number of states in the US and the average number of zip codes per state, which we estimate to be 20.

Therefore, the number of distinct values for the StateCode + ZipCode index would be roughly 50 * 20, or 1,000.

The same considerations hold most of the other rows per value calculations.
For example, if every value of column_1 can be paired validly with every value for column_2, then the result of the division is a fairly accurate estimate.

For each value estimated, the minimum is never less than the number of column_1 rows per value divided by the number of distinct column_2 values.

The exception is the count of Rows Null (labeled Rows/Null). See the next section for information on handling multicolumn indexes that are wholly or partially null.

Treating the Rows Null Counts for Composite Indexes

An index can be partly or wholly null. This is almost never a good idea, however. See Designing for Missing Information for information about SQL nulls and the many problems they cause with database integrity.

Consider the following example:

employee_number	department_number
PK
FK	FK
103794	7012
?	7012
105378	?
?	?

The first row has both employee and department numbers.
The second row has a null employee number and a department number.
The third row has an employee number and a null department number.
The fourth row has nulls for both its employee number and its department number.

All four rows are valid primary index rows. Should they all contribute to the count for the Rows Null value or should they be treated differently?

Only the first of these 4 rows is a valid primary key because by the entity integrity rule (see Rules for Primary Keys), a primary key cannot contain nulls by definition. This is only one of the many reasons that indexes should not be an issue during logical database design.

The following table provides recommended treatment guidelines for partially and wholly null composite indexes. Composite indexes are also referred to as multicolumn indexes.

IF a composite index value is …	THEN you should …
wholly null	add it to the Rows Null count.
partially null	not add it to the Rows Null count.

The Optimizer can detect the number of distinct values from the rows that are partially null and it tracks them separately from the rows that are wholly null. If a column has many nulls, but is not usually specified in predicates, you probably should not include that column in the multicolumn statistics you collect.

For example, suppose you are considering collecting multicolumn statistics on columns 1, 2, and 3 of a table, and their values are like the following, where a QUESTION MARK character represents a null.

column 1	column 2	column 3
1	2	3
?	2	3
1	?	3
1	2	?
?	?	?
1	2	?
1	2	?
1	2	?
1	2	?
1	2	?

You would probably not want to collect multicolumn statistics that include column 3 unless it is specified in a large number of predicates.