A histogram is a count of the number of occurrences, or cardinality, of a particular category of data that fall into defined disjunct value range categories (referred to as intervals).
The Interval Histograms Used by Teradata Database
Teradata Database uses equalheight, highbiased, and history interval histograms to represent the cardinalities and other statistical values and demographics of columns and indexes for allAMPs sampled statistics and for fulltable statistic. The greater the number of intervals in a histogram, the more accurately it can describe the distribution of data by characterizing a smaller percentage of its composition per each interval.
Teradata Database determines the number of highbiased, equalheight intervals; history records that can be accommodated using a BLOB with maximum size 64 KB.
You can determine the maximum number of intervals (between 10 and 500) used for a histogram by specifying the MAXINTERVALS USING option when you collect statistics. The default maximum is 250.
The number of intervals used to store statistics is a function of the number of distinct values in the column set represented. For example, if there are only 10 unique values in a column or index set, Teradata Database does not store the statistics for that column across 500 intervals, but across 11 (the master record plus the 10 additional intervals containing the frequencies for the distinct values in the set).
Teradata Database employs the maximum number of intervals only when the number of distinct values in the column or index set for which statistics are being captured equals or exceeds the maximum number of intervals.
The statistical and demographic information maintained in the histograms is used to estimate various attributes of a query, most importantly the cardinalities of various aspects of the relations that are specified in the request.
Note that Teradata Database also uses derived statistics to estimate cardinalities and selectivities. For more information, Derived Statistics. These statistics are based on the initial values stored in the interval histograms, but are then adjusted for accuracy at each stage of query optimization by incorporating additional information such as CHECK and referential integrity constraints, query predicates, and hash and join indexes.
See Derived Statistics for details.
Freshness of Interval Histogram Statistics
The statistics contained in interval histograms are, by their very nature, out of date as soon as the data changes.
The Teradata query optimizer uses derived statistics to reduce the worstcase error of estimating query cardinalities at the various stages of optimization (see Derived Statistics) and provides several threshold and sampling options for collecting statistics that enable you to avoid recollecting statistics that do not need to be refreshed.
See “COLLECT STATISTICS (Optimizer Form)” in SQL Data Definition Language for more information.
Interval Histogram Terminology
The terms used to describe the intervals and histograms used by Teradata Database are defined in the following table:
Term  Definition 

Cardinality  The number of rows per AMP that satisfy a predicate condition. Note that in this context, cardinality generally is not the number of rows in the entire table, but the number of rows that qualify for a predicate condition. See Using Interval Histograms to Make Initial Cardinality Estimates for a fuller explanation of what cardinality means in this context. 
Compressed interval histogram  A histogram that combines highbiased intervals and equalheight intervals or highbiased intervals only, or both, with a master record. 
Equalheight interval  An interval containing column
statistics normalized across the distribution in such a way that the
graph of the distribution of the number of rows as a function of
interval number is flat. This is achieved by varying the width of each interval so it contains approximately the same number of rows (but with different attribute value ranges) as its neighbors. An equalheight interval does not include loner values or NULLs and their frequencies. Equalheight intervals are also known as equaldepth intervals. The definitions for some of the fields in an equalheight are indicated by the following list.

Equalheight interval histogram  A histogram characterized by an
array of ordered, equalheight intervals. Also known as equaldepth interval histograms. 
Highbiased interval  An interval used to characterize a nonNULL skewed value (that is, a loner) for a column. Any value that is significantly skewed is summarized by a highbiased interval. See Loner. The definitions for some of the fields in a highbiased interval are indicated by the following list:

Highbiased interval histogram  A histogram characterized by an array of ordered, highbiased intervals of loners. 
Histogram  A means of representing
distributions as a function of the number of elements per a
determined interval width. Histograms are often called bar charts. Each bar in a histogram represents the number of rows for the defined interval. In relational query optimization theory, the term is used to describe the rows in a Data Dictionary table or system catalog that store the particular intervals used to characterize the frequency distribution of the attribute values for a column set. All histograms described in this topic are frequency histograms. Each interval in a frequency histogram contains fields representing the number of rows that have the attribute values belonging to its range. 
History record  An interval containing historical
information about a histogram including MinValue, MaxValue,
HighModeValue, and HighModeFrequency statistics as well as other
relevant historical data. When you recollect statistics, Teradata Database saves the summary information from the previous histogram as a history record in the new histogram. Teradata Database determines the number of history records a histogram maintains based on a trend analysis. When history records are no longer needed, the system purges them. 
Interval  A bounded, nonoverlapping set of
attribute value frequencies. Sometimes called a bin or bucket. 
Loner, or highbiased value  A loner is a nonNULL value with a high frequency; a highly frequent value can indicate significant skew. To be considered a loner, a value must satisfy the following condition:
where:
Based on the maximum size of 64 KB, Teradata Database determines the number of highbiased intervals and equalheight intervals that can be accommodated in a histogram, while also leaving space for the history records. 
Master Record  A row in DBC.StatsTbl with a StatsId column value of 0 that acts as a statistics
identifier within each source column or index statistics set. The Optimizer uses the master record for each column or index set to maintain the cardinality, average row size, dynamic AMP sample estimates, and other objectlevel attributes. Whenever statistics are collected or updated for any column or index set, Teradata Database updates the master record of the source to reflect the latest objectlevel statistical information. When statistics are refreshed on a column set, the Optimizer resets the UDI counts. Because different statistics can be collected at different times, the current logs of delete and insert counts are retained with the master record before Teradata Database resets them. The UPD counts are retained at the level of individual column statistics. The individual column statistics refer to master record delete and insert counts, column statisticslevel UPD counts, or both, and the latest system UDI counts from DBC.ObjectUsage to determine the overall effective UDI counts for a given statistics. 
Skew  A measure of the asymmetry of the distribution of a set of attribute values or their frequencies. With respect to skew in parallel databases, there are several possible types.
The difference is apparent from the context. As used in this book, the term usually refers to partition skew. Skew is somewhat more likely to be seen with NoPI because of the way the system distributes their rows or with intermediate results. For information about how the Optimizer deals with skew when performing an equality join on skewed tables, see Strategies for Joining Skewed Tables on an Equality Join Condition. 
Types of Interval Histograms Used By Teradata Database
Depending on the distribution of values, which is also referred to as the degree of skew, in a column or index set, any one of four varieties of histogram types can be used to represent its statistics.
 Equalheight interval histogram
Equalheight interval histograms have much lower worstcase and average errors for a wide variety of queries than do equalwidth interval histograms.
The statistics for a column set are expressed as an equalheight interval histogram if none of the frequencies of its values are skewed.
Teradata Database determines the number of highbiased intervals and equalheight intervals that can be accommodated while also leaving space for history records.
In an equalheight interval histogram each interval represents approximately the same number of rows having that range of values, making their ordinate, cardinality, an approximately constant value or height. If a row has a value for a column that is already represented in an interval, then it is counted in that interval, so some intervals may represent more rows than others.
For example, suppose there are approximately 2.5 million rows in a table. Then there would be approximately 10,000 rows per interval, assuming a 250 interval histogram.
Suppose that for a given interval, Teradata Database processes the values for 9,900 rows and the next value is present in 300 rows. Those rows would be counted in this interval, and the cardinality for the interval would 10,200. Alternatively, the rows could be counted in the next interval, so this interval would only represent 9,900 rows.
A COLLECT STATISTICS request divides the rows in ranges of values such that each range has approximately the same number of rows, but it never splits rows with the same value across intervals. To achieve constant interval cardinalities, the interval widths, or value ranges, must vary.
If a histogram contains 250 equalheight intervals, each interval effectively represents 0.40% for the population of attribute values it represents.
The following graphic illustrates the concept of an equalheight interval histogram. Note that the number of rows per equalheight interval is only approximately equal in practice, so the graphic is slightly misleading with respect to the precise equality of heights for all equalheight intervals in the histogram.
 Highbiased interval histogram
Highbiased intervals are used to represent a column or index set only when there is significant skew in the frequency distribution of its values.
In a highbiased interval histogram, each highbiased interval contains only one value and its frequency.
 Compressed histogram
Compressed histograms contain a mix of equal and highbiased intervals plus a master record.
Compressed histograms are an improvement over pure equalheight histograms because they provide exact statistics for the largest values in the data. This is useful for join estimation, for example, which compares the largest values in relations to be joined.
Any highbiased intervals always precede equalheight intervals in a compressed histogram.
The following graphic illustrates the concept of a compressed histogram with 201 intervals. Note that the number of rows per equalheight interval is only approximately equal in practice, so the graphic is slightly misleading with respect to the precise equality of heights for all equalheight intervals in the compressed histogram.
In this example of a compressed interval histogram, the first three intervals define three loner values and the last several define history intervals.
 History record
History records archive a chronicle of the histogram. When you recollect statistics, Teradata Database saves the summary information from the previous histogram as a history interval in the new histogram. The system determines the number of history intervals based on a trend analysis. Teradata Database purges the history intervals that are no longer needed.
A history interval contains MinValue, MaxValue, HighModeValue, HighMode frequency and some additional data.
The number of history intervals maintained for each histogram is based on the following three criteria:
 The significance of the history interval.
The determination of number of history records to retain or purge is determined by their significance.
The significance of a history interval depends on the significance of the interval’s participation in the most recent trend analysis. If the significance of a history interval is less than a user or systemdefined threshold, Teradata Database purges it from the histogram.
The significance of a history record is calculated based on the significances of statistical values such as NumRow, UV, AllNull, NumNull, MinVal, and MaxVal from the history interval. The significance of a single statistical value is determined by the following rules.
 If Teradata Database does not use the statistical value in recent trending analyses, its significance is 0.
 If the statistical value does participate in recent trending analyses, its significance is proportional to an absolute difference of the reliabilities with and without the statistical value.
 The minimum number of history records, which Teradata Database sets to 10 by default, indicates the minimum number of history intervals to retain that is determined to be significant. This default is chosen because a minimum of 10 history intervals is required to achieve the necessary reliability factor of a value greater than 0.9.
 The maximum number of history records, which Teradata Database sets to 20 by default, indicates the maximum number of history intervals that are determined to be significant. This default is chosen to be twice the minimum value, which should be sufficient for any trend analysis.
 The significance of the history interval.
Content and Storage of Histograms
Index and column histograms are stored in the Histogram column of DBC.StatsTbl. This column has a BLOB data type.
Rows for each histogram are separated into the following categories:
 Nulls
 Loners
 Nonloners
Recall that a histogram can be composed of either of the following types of intervals in addition to a master record:
 499 highbiased (loner) intervals and one equalheight interval
 All equalheight (nonloner) intervals, referred to as an equalheight interval histogram.
 A mix of highbiased, history, and equalheight intervals, referred to as a compressed histogram.
The system stores histogram values in a format that is similar to a row structure. All the variable data is stored toward the end of the histogram, and their corresponding offsets are saved in the interval buckets. Data value sizes are constant in the histogram buckets because of the offset usage for variablelength columns. This enables the histogram buckets to have a fixed size, which makes it possible to randomly position to any bucket.
A value descriptor which is saved in the histogram header is used to encode and decode the data values of the histogram.
 Summary statistics are recorded in the histogram header.
Use the HELP STATISTICS statement to report the summary statistics for a histogram.
 Highbiased intervals.
 Equalheight intervals.
 History intervals.
Use the SHOW STATISTICS statement to report the detailed statistics for a histogram.
Teradata Database uses a value descriptor from the histogram header to encode and decode the data values of the histogram. If the histogram is built on multiple columns, Teradata Database maintains it as an array with each component element describing a single column. The array uses a single element to store singlecolumn statistics and multiple elements to store multicolumn statistics. If an array element is a variablelength data type, Teradata Database stores a 4byte offset in the interval for quick access. For fixed length types, the actual data is saved in the intervals. All the variable offsets are saved first followed by the fixed type data.
The histogram header contains all of the summary information and the information to encode and decode the interval buckets. The highlevel structure of a histogram bucket is as follows, with data being stored in the order indicated.
 Biased values [(value_1), (value_2) …]
 Biased value frequencies []
 Equal height values [(Max, Mode) …]
 Equal height bucket counts [(modal frequency, other Vals, other rows) …]
 History record values [(Min, Max, HighMode, HighModeFreq) …]
 History record bucket counts [(OptHistoryIntervalDataType) …]
 Data values for variablelength columns.
Teradata Database stores the corresponding offsets to variablelength columns in the histogram buckets.
There are 3 types of histograms in a Teradata Database interval histogram: highbiased, equalheight, and history. There is also the compressed histogram type, which mixes equalheight and highbiased intervals.
The offset to each bucket in the histogram is computed from its base offset, which is saved in the histogram header.
 Highbiased intervals
Highbiased intervals are used to represent a column or index set only when there is significant skew in the frequency distribution of its values.
Teradata Database saves the loner values and their corresponding frequencies for highbiased intervals in the following general structure.
Statistic  Description 

Loner value  The loner value stored in the interval. 
Loner value frequency  The number of rows that have this loner value. 
 Equalheight intervals
Each equalheight interval contains a mode value for the interval and its frequency, the number of values other than the modal value in the interval, the number of other rows in the interval other than those having the mode frequency, and the lowest frequency among the Other values. Teradata Database uses the lowest frequency to determine the standard deviation in various estimates.
The mode value is saved based on the value descriptor. The frequency and other values/rows are saved in the following general structure.
Statistic  Description 

Mode frequency  The frequency of occurrence of the modal numeric value in the interval. 
Other values  The cardinality of nonmodal values in the interval. 
Other rows  The cardinality of rows other than those having the modal frequency. 
Lowest frequency  The lowest frequency of other (nonmodal) values in the interval. 
 Compressed histogram
Compressed histograms contain a mix of equalheight and highbiased intervals. Compressed histograms are an improvement over pure equalheight histograms because they provide exact statistics for the values with the largest frequencies in the data. This is useful for join estimation, for example, which compares the largest values in relations to be joined.
 History records
This interval type records the history of the histogram. When you recollect statistics, Teradata Database saves the summary information from the previous histogram as a history interval in the histogram.
Teradata Database determines how many history intervals to keep based on a linear trend analysis, and purges history records when they are no longer needed.
The general structure of a history record is as follows:
Statistic  Description 

ANSI timestamp  Statistics collection timestamp. 
Number of highbiased (loners) values  Number of highbiased values (loners) in the histogram. 
Number of equalheight intervals  Number of equalheight intervals in the histogram. 
Usage type  Description of whether the collected statistics are summary or detailed. 
Sampling percentage  If the collected statistics were sampled, states the sampling percentage used. 
Sampling decision by  If sampling was used to collect the statistics, states whether the sampling was systemdetermined or userdetermined. 
Equalheight interval deviations  Deviation of value frequencies within the equalheight intervals in the histogram. 
Number of nulls  Number of partly null rows in the histogram. 
Number of all nulls  Number of all null rows in the histogram. 
High mode frequency  Frequency of occurrence of the highest mode numeric value in the histogram. 
Number of values  Number of distinct values in the histogram. 
Number of rows  Number of rows in the histogram. 
CPU cost  States how much CPU time was consumed collecting the statistics for the histogram. 
I/O cost  States how much input/output was consumed collecting the statistics for the histogram. 
None  Reserved for future use. 
None  Reserved for future use. 
Data Types of Stored Statistics
Teradata Database stores all statistics using the native data type for the column. Because of this, there are no problems with loss of precision, truncation, or with oversized histograms caused by the data types of the values being stored.