16.10 - Interval Histograms - Teradata Database

Teradata Database SQL Request and Transaction Processing

prodname
Teradata Database
vrm_release
16.10
created_date
June 2017
category
Programming Reference
User Guide
featnum
B035-1142-161K

A histogram is a count of the number of occurrences, or cardinality, of a particular category of data that fall into defined disjunct value range categories (referred to as intervals).

The Interval Histograms Used by Teradata Database

Teradata Database uses equal-height, high-biased, and history interval histograms to represent the cardinalities and other statistical values and demographics of columns and indexes for all-AMPs sampled statistics and for full-table statistic. The greater the number of intervals in a histogram, the more accurately it can describe the distribution of data by characterizing a smaller percentage of its composition per each interval.

Teradata Database determines the number of high-biased, equal-height intervals; history records that can be accommodated using a BLOB with maximum size 64 KB.

You can determine the maximum number of intervals (between 10 and 500) used for a histogram by specifying the MAXINTERVALS USING option when you collect statistics. The default maximum is 250.

The number of intervals used to store statistics is a function of the number of distinct values in the column set represented. For example, if there are only 10 unique values in a column or index set, Teradata Database does not store the statistics for that column across 500 intervals, but across 11 (the master record plus the 10 additional intervals containing the frequencies for the distinct values in the set).

Teradata Database employs the maximum number of intervals only when the number of distinct values in the column or index set for which statistics are being captured equals or exceeds the maximum number of intervals.

The statistical and demographic information maintained in the histograms is used to estimate various attributes of a query, most importantly the cardinalities of various aspects of the relations that are specified in the request.

Note that Teradata Database also uses derived statistics to estimate cardinalities and selectivities. For more information, Derived Statistics. These statistics are based on the initial values stored in the interval histograms, but are then adjusted for accuracy at each stage of query optimization by incorporating additional information such as CHECK and referential integrity constraints, query predicates, and hash and join indexes.

See Derived Statistics for details.

Freshness of Interval Histogram Statistics

The statistics contained in interval histograms are, by their very nature, out of date as soon as the data changes.

The Teradata query optimizer uses derived statistics to reduce the worst-case error of estimating query cardinalities at the various stages of optimization (see Derived Statistics) and provides several threshold and sampling options for collecting statistics that enable you to avoid recollecting statistics that do not need to be refreshed.

See “COLLECT STATISTICS (Optimizer Form)” in SQL Data Definition Language for more information.

Interval Histogram Terminology

The terms used to describe the intervals and histograms used by Teradata Database are defined in the following table:

Term Definition
Cardinality The number of rows per AMP that satisfy a predicate condition.

Note that in this context, cardinality generally is not the number of rows in the entire table, but the number of rows that qualify for a predicate condition. See Using Interval Histograms to Make Initial Cardinality Estimates for a fuller explanation of what cardinality means in this context.

Compressed interval histogram A histogram that combines high-biased intervals and equal-height intervals or high-biased intervals only, or both, with a master record.
Equal-height interval An interval containing column statistics normalized across the distribution in such a way that the graph of the distribution of the number of rows as a function of interval number is flat.

This is achieved by varying the width of each interval so it contains approximately the same number of rows (but with different attribute value ranges) as its neighbors. An equal-height interval does not include loner values or NULLs and their frequencies.

Equal-height intervals are also known as equal-depth intervals.

The definitions for some of the fields in an equal-height are indicated by the following list.

  • Values is the total number of non-mode values in the interval.

    When an equal-height interval represents a single value, the value is saved as a Mode Value. Teradata Database sets the number for Values, which is the total number of non-mode values, to 0 in this case.

  • Mode is the most frequent value in the interval.
  • Mode frequency is the number of rows having the mode value.
  • Maximum value is the maximum value covered by the interval.
  • Rows is the total number of rows for the non-mode values in the interval.
Equal-height interval histogram A histogram characterized by an array of ordered, equal-height intervals.

Also known as equal-depth interval histograms.

High-biased interval An interval used to characterize a non-NULL skewed value (that is, a loner) for a column.

Any value that is significantly skewed is summarized by a high-biased interval. See Loner.

The definitions for some of the fields in a high-biased interval are indicated by the following list:

  • The loner value in the interval.
  • The number of rows having the loner value.
High-biased interval histogram A histogram characterized by an array of ordered, high-biased intervals of loners.
Histogram A means of representing distributions as a function of the number of elements per a determined interval width.

Histograms are often called bar charts. Each bar in a histogram represents the number of rows for the defined interval.

In relational query optimization theory, the term is used to describe the rows in a Data Dictionary table or system catalog that store the particular intervals used to characterize the frequency distribution of the attribute values for a column set.

All histograms described in this topic are frequency histograms. Each interval in a frequency histogram contains fields representing the number of rows that have the attribute values belonging to its range.

History record An interval containing historical information about a histogram including MinValue, MaxValue, HighModeValue, and HighModeFrequency statistics as well as other relevant historical data.

When you recollect statistics, Teradata Database saves the summary information from the previous histogram as a history record in the new histogram. Teradata Database determines the number of history records a histogram maintains based on a trend analysis. When history records are no longer needed, the system purges them.

Interval A bounded, non-overlapping set of attribute value frequencies.

Sometimes called a bin or bucket.

Loner, or high-biased value A loner is a non-NULL value with a high frequency; a highly frequent value can indicate significant skew.

To be considered a loner, a value must satisfy the following condition:

where:

  • f defines the frequency of the value.
  • T defines the cardinality of the relation.

Based on the maximum size of 64 KB, Teradata Database determines the number of high-biased intervals and equal-height intervals that can be accommodated in a histogram, while also leaving space for the history records.

Master Record A row in DBC.StatsTbl with a StatsId column value of 0 that acts as a statistics identifier within each source column or index statistics set.

The Optimizer uses the master record for each column or index set to maintain the cardinality, average row size, dynamic AMP sample estimates, and other object-level attributes.

Whenever statistics are collected or updated for any column or index set, Teradata Database updates the master record of the source to reflect the latest object-level statistical information.

When statistics are refreshed on a column set, the Optimizer resets the UDI counts. Because different statistics can be collected at different times, the current logs of delete and insert counts are retained with the master record before Teradata Database resets them. The UPD counts are retained at the level of individual column statistics. The individual column statistics refer to master record delete and insert counts, column statistics-level UPD counts, or both, and the latest system UDI counts from DBC.ObjectUsage to determine the overall effective UDI counts for a given statistics.

Skew A measure of the asymmetry of the distribution of a set of attribute values or their frequencies.

With respect to skew in parallel databases, there are several possible types.

  • Attribute value skew refers to skew that is inherent in the data. An example might be a column that can have only 2 values such as TRUE or FALSE.
  • Partition skew refers to skew that results from an uneven distribution of data across the AMPs.

The difference is apparent from the context. As used in this book, the term usually refers to partition skew.

Skew is somewhat more likely to be seen with NoPI because of the way the system distributes their rows or with intermediate results.

For information about how the Optimizer deals with skew when performing an equality join on skewed tables, see Strategies for Joining Skewed Tables on an Equality Join Condition.

Types of Interval Histograms Used By Teradata Database

Depending on the distribution of values, which is also referred to as the degree of skew, in a column or index set, any one of four varieties of histogram types can be used to represent its statistics.

  • Equal-height interval histogram

    Equal-height interval histograms have much lower worst-case and average errors for a wide variety of queries than do equal-width interval histograms.

    The statistics for a column set are expressed as an equal-height interval histogram if none of the frequencies of its values are skewed.

    Teradata Database determines the number of high-biased intervals and equal-height intervals that can be accommodated while also leaving space for history records.

    In an equal-height interval histogram each interval represents approximately the same number of rows having that range of values, making their ordinate, cardinality, an approximately constant value or height. If a row has a value for a column that is already represented in an interval, then it is counted in that interval, so some intervals may represent more rows than others.

    For example, suppose there are approximately 2.5 million rows in a table. Then there would be approximately 10,000 rows per interval, assuming a 250 interval histogram.

    Suppose that for a given interval, Teradata Database processes the values for 9,900 rows and the next value is present in 300 rows. Those rows would be counted in this interval, and the cardinality for the interval would 10,200. Alternatively, the rows could be counted in the next interval, so this interval would only represent 9,900 rows.

    A COLLECT STATISTICS request divides the rows in ranges of values such that each range has approximately the same number of rows, but it never splits rows with the same value across intervals. To achieve constant interval cardinalities, the interval widths, or value ranges, must vary.

    If a histogram contains 250 equal-height intervals, each interval effectively represents 0.40% for the population of attribute values it represents.

    The following graphic illustrates the concept of an equal-height interval histogram. Note that the number of rows per equal-height interval is only approximately equal in practice, so the graphic is slightly misleading with respect to the precise equality of heights for all equal-height intervals in the histogram.



  • High-biased interval histogram

    High-biased intervals are used to represent a column or index set only when there is significant skew in the frequency distribution of its values.

    In a high-biased interval histogram, each high-biased interval contains only one value and its frequency.



  • Compressed histogram

    Compressed histograms contain a mix of equal- and high-biased intervals plus a master record.

    Compressed histograms are an improvement over pure equal-height histograms because they provide exact statistics for the largest values in the data. This is useful for join estimation, for example, which compares the largest values in relations to be joined.

    Any high-biased intervals always precede equal-height intervals in a compressed histogram.

    The following graphic illustrates the concept of a compressed histogram with 201 intervals. Note that the number of rows per equal-height interval is only approximately equal in practice, so the graphic is slightly misleading with respect to the precise equality of heights for all equal-height intervals in the compressed histogram.

    In this example of a compressed interval histogram, the first three intervals define three loner values and the last several define history intervals.



  • History record

    History records archive a chronicle of the histogram. When you recollect statistics, Teradata Database saves the summary information from the previous histogram as a history interval in the new histogram. The system determines the number of history intervals based on a trend analysis. Teradata Database purges the history intervals that are no longer needed.

    A history interval contains MinValue, MaxValue, HighModeValue, HighMode frequency and some additional data.

    The number of history intervals maintained for each histogram is based on the following three criteria:

    • The significance of the history interval.

      The determination of number of history records to retain or purge is determined by their significance.

      The significance of a history interval depends on the significance of the interval’s participation in the most recent trend analysis. If the significance of a history interval is less than a user- or system-defined threshold, Teradata Database purges it from the histogram.

      The significance of a history record is calculated based on the significances of statistical values such as NumRow, UV, AllNull, NumNull, MinVal, and MaxVal from the history interval. The significance of a single statistical value is determined by the following rules.

      • If Teradata Database does not use the statistical value in recent trending analyses, its significance is 0.
      • If the statistical value does participate in recent trending analyses, its significance is proportional to an absolute difference of the reliabilities with and without the statistical value.
    • The minimum number of history records, which Teradata Database sets to 10 by default, indicates the minimum number of history intervals to retain that is determined to be significant. This default is chosen because a minimum of 10 history intervals is required to achieve the necessary reliability factor of a value greater than 0.9.
    • The maximum number of history records, which Teradata Database sets to 20 by default, indicates the maximum number of history intervals that are determined to be significant. This default is chosen to be twice the minimum value, which should be sufficient for any trend analysis.

Content and Storage of Histograms

Index and column histograms are stored in the Histogram column of DBC.StatsTbl. This column has a BLOB data type.

Rows for each histogram are separated into the following categories:

  • Nulls
  • Loners
  • Non-loners

Recall that a histogram can be composed of either of the following types of intervals in addition to a master record:

  • 499 high-biased (loner) intervals and one equal-height interval
  • All equal-height (non-loner) intervals, referred to as an equal-height interval histogram.
  • A mix of high-biased, history, and equal-height intervals, referred to as a compressed histogram.

The system stores histogram values in a format that is similar to a row structure. All the variable data is stored toward the end of the histogram, and their corresponding offsets are saved in the interval buckets. Data value sizes are constant in the histogram buckets because of the offset usage for variable-length columns. This enables the histogram buckets to have a fixed size, which makes it possible to randomly position to any bucket.

A value descriptor which is saved in the histogram header is used to encode and decode the data values of the histogram.

  • Summary statistics are recorded in the histogram header.

    Use the HELP STATISTICS statement to report the summary statistics for a histogram.

  • High-biased intervals.
  • Equal-height intervals.
  • History intervals.

Use the SHOW STATISTICS statement to report the detailed statistics for a histogram.

Teradata Database uses a value descriptor from the histogram header to encode and decode the data values of the histogram. If the histogram is built on multiple columns, Teradata Database maintains it as an array with each component element describing a single column. The array uses a single element to store single-column statistics and multiple elements to store multicolumn statistics. If an array element is a variable-length data type, Teradata Database stores a 4-byte offset in the interval for quick access. For fixed length types, the actual data is saved in the intervals. All the variable offsets are saved first followed by the fixed type data.

The histogram header contains all of the summary information and the information to encode and decode the interval buckets. The high-level structure of a histogram bucket is as follows, with data being stored in the order indicated.

  1. Biased values [(value_1), (value_2) …]
  2. Biased value frequencies []
  3. Equal height values [(Max, Mode) …]
  4. Equal height bucket counts [(modal frequency, other Vals, other rows) …]
  5. History record values [(Min, Max, HighMode, HighModeFreq) …]
  6. History record bucket counts [(OptHistoryIntervalDataType) …]
  7. Data values for variable-length columns.

    Teradata Database stores the corresponding offsets to variable-length columns in the histogram buckets.

There are 3 types of histograms in a Teradata Database interval histogram: high-biased, equal-height, and history. There is also the compressed histogram type, which mixes equal-height and high-biased intervals.

The offset to each bucket in the histogram is computed from its base offset, which is saved in the histogram header.

  • High-biased intervals

    High-biased intervals are used to represent a column or index set only when there is significant skew in the frequency distribution of its values.

    Teradata Database saves the loner values and their corresponding frequencies for high-biased intervals in the following general structure.

Statistic Description
Loner value The loner value stored in the interval.
Loner value frequency The number of rows that have this loner value.
  • Equal-height intervals

    Each equal-height interval contains a mode value for the interval and its frequency, the number of values other than the modal value in the interval, the number of other rows in the interval other than those having the mode frequency, and the lowest frequency among the Other values. Teradata Database uses the lowest frequency to determine the standard deviation in various estimates.

    The mode value is saved based on the value descriptor. The frequency and other values/rows are saved in the following general structure.

Statistic Description
Mode frequency The frequency of occurrence of the modal numeric value in the interval.
Other values The cardinality of non-modal values in the interval.
Other rows The cardinality of rows other than those having the modal frequency.
Lowest frequency The lowest frequency of other (non-modal) values in the interval.
  • Compressed histogram

    Compressed histograms contain a mix of equal-height and high-biased intervals. Compressed histograms are an improvement over pure equal-height histograms because they provide exact statistics for the values with the largest frequencies in the data. This is useful for join estimation, for example, which compares the largest values in relations to be joined.

  • History records

    This interval type records the history of the histogram. When you recollect statistics, Teradata Database saves the summary information from the previous histogram as a history interval in the histogram.

    Teradata Database determines how many history intervals to keep based on a linear trend analysis, and purges history records when they are no longer needed.

    The general structure of a history record is as follows:

Statistic Description
ANSI timestamp Statistics collection timestamp.
Number of high-biased (loners) values Number of high-biased values (loners) in the histogram.
Number of equal-height intervals Number of equal-height intervals in the histogram.
Usage type Description of whether the collected statistics are summary or detailed.
Sampling percentage If the collected statistics were sampled, states the sampling percentage used.
Sampling decision by If sampling was used to collect the statistics, states whether the sampling was system-determined or user-determined.
Equal-height interval deviations Deviation of value frequencies within the equal-height intervals in the histogram.
Number of nulls Number of partly null rows in the histogram.
Number of all nulls Number of all null rows in the histogram.
High mode frequency Frequency of occurrence of the highest mode numeric value in the histogram.
Number of values Number of distinct values in the histogram.
Number of rows Number of rows in the histogram.
CPU cost States how much CPU time was consumed collecting the statistics for the histogram.
I/O cost States how much input/output was consumed collecting the statistics for the histogram.
None Reserved for future use.
None Reserved for future use.

Data Types of Stored Statistics

Teradata Database stores all statistics using the native data type for the column. Because of this, there are no problems with loss of precision, truncation, or with oversized histograms caused by the data types of the values being stored.