15.10 - Stale Statistics - Teradata Database

Teradata Database SQL Request and Transaction Processing

Teradata Database
Programming Reference
User Guide

Stale statistics are interval histogram statistics that no longer represent an accurate description of the column sets on which they were originally collected.

Interval histogram statistics are the initial source of column demographic information for the Optimizer to use in making its estimates of join cardinality, rows per value, skew adjustments, and several other pieces of information necessary for accurate query optimization.

The Optimizer also employs more dynamic methods of using statistics such as deriving column correlations using multicolumn statistics, adjusting and propagating the number of unique values after each binary join is made, and so on. The methods are developed and generalized in a consistent manner throughout the Optimizer.

Cost and cardinality estimation formulas and the strategies for deriving column correlations depend on the accuracy of the available statistics to produce optimal plans. The software might not produce reasonable plans if its statistical inputs are not accurate. The most common reason statistics are not accurate is because they are stale.

The process of detecting stale statistics and extrapolating data values (see “Using Extrapolation to Replace Stale Statistics” on page 288) when the Optimizer determines that the statistics it needs are stale is designed to reduce the frequency of collecting or recollecting statistics.

The estimated cardinality of a table is critical information that Teradata Database needs to detect stale histograms and to activate extrapolation.

Every histogram contains a cardinality value that is based on the number of rows in the table at the time the statistics were collected. Teradata Database compares the histogram row count with the newly calculated estimate of the table cardinality (see “About Extrapolating Statistics” on page 291) to determine whether a histogram is stale. If the Optimizer determines that a histogram is stale, it applies extrapolation techniques when doing selectivity estimations.

The definition of stale statistics and the problems faced by the Optimizer when dealing with them were introduced earlier in this chapter in the following topics.

  • “Using Interval Histogram Statistics or a Dynamic AMP Sample for NUSI Subtable Statistics” on page 141
  • “An Example of How Stale Statistics Can Produce a Poor Query Plan” on page 144
  • You can use one or both of the following methods to enable Teradata Database to recollect statistics only when a limit that you specify with a COLLECT STATISTICS request is exceeded. See “Optimal Times to Collect or Recollect Statistics” on page 192 and “Optimizing the Recollection of Statistics” on page 283 for details about how to do this.

    For additional information, see “COLLECT STATISTICS (Optimizer Form)” in SQL Data Definition Language.

  • Set a recollection threshold
  • The THRESHOLD option enables you to instruct Teradata Database to skip recollecting statistics if either the change in the data from the last statistics collection or the age of the statistics is below the thresholds you specified when you initially collected the statistics.

    See “Optimizing Statistics Recollection Using Threshold Options” on page 283 for further information.

  • Collect sampled, rather than full, statistics
  • Two methods exist for determining sampling defaults.

  • You can specify the sampling percentage for Teradata Database to use.
  • You can enable Teradata Database to determine the sampling percentage to use based on a set of criteria that it chooses.
  • When Teradata Database determines the sampling percentage, it uses a method of downgrading statistics collection from a 100% sampling to a sampling rate it selects to optimize recollections for a particular column or index set.

    See “Optimizing Statistics Recollection Using Sampling” on page 285 for further information.

    You can specify the sampling percentage for Teradata Database to use to determine sampling defaults.

    The following text describes the four principal ways the cardinality estimation subsystem handles the problem of stale statistics.

  • Adjusting the total cardinality
  • The Optimizer uses table‑level SUMMARY statistics to determine the growth of a table. To do this, it compares the cardinality estimate from the current dynamic AMP sample against the cardinality from the residual table‑level statistics. The Optimizer then adds the estimated growth to the cardinality estimate from the SUMMARY statistics to estimate the current cardinality of the table.

  • If the primary index is not skewed and if the deviation between the sample and the interval histogram row count is more than the default percentage, the sample row count is taken as the cardinality of the table. The skew can be identified from the high modal frequency value stored in the interval histogram.
  • If no SUMMARY statistics are available, the sampled row count is taken to be the table cardinality as is done by the current Optimizer software.
  • Adjusting the number of unique values.
  • For indexes, the cardinality is estimated as the total number of unique values of the indexed columns.
  • In this case, the number of unique values is adjusted in the same way as the total rows described in the first bullet under Adjusting the total cardinality, above.

  • If the adjusted table row count deviates more than a defined percentage from the histogram row count on unindexed columns, then the number of unique values, which is the principal value input to the join cardinality and costing related formulas, is, assuming a uniform distribution, either scaled up or down for the following scenarios.
  • The histogram is on a DATE column.
  • The histogram is on multiple columns that have a DATE column component.
  • The histogram is on a unique index.
  • The histogram is on a softly unique column.
  • Soft uniqueness is defined as the case where the interval histogram statistics indicate that the number of unique values is close to the total cardinality for the table.

    The scaling operation assumes a constant number of rows per value. For example, assume that the table cardinality is estimated to be 100 rows from a dynamic AMP sample. If, for a column x1, the histogram indicates that the row count is 50 and the number of unique values is 25, meaning the number of Rows Per Value is 2, then the number of unique values is scaled up to .

    Note that the value scaling, whether up or down, cannot be done unconditionally for all columns.

    For example, columns like product_id, business_unit_id, and so on cannot have new adjusted values added to them because it makes no semantic sense. Their values are not only arbitrary, but also are not cardinal numbers, so it is meaningless to perform mathematical manipulations on them, particularly in this context.

  • Considering statistics collection timestamps
  • While inheriting the statistics from the single‑table non‑sparse join or hash index by the parent base table and vice versa, when statistics are available for a column in both the source and destination interval histograms, the system uses the most recent statistics of the 2 as determined by a timestamp comparison.

  • Adjusting the confidence level
  • If the available histogram was created from sampled statistics, then the system always lowers the their confidence level.

    You should always keep the statistics for skewed columns and indexes current.

    The handling of stale statistics for single‑table cardinality estimates is done using various extrapolation methods.