Stale statistics are interval histogram statistics that no longer represent an accurate description of the column sets on which they were originally collected.
Interval histogram statistics are the initial source of column demographic information for the Optimizer to use in making its estimates of join cardinality, rows per value, skew adjustments, and several other pieces of information necessary for accurate query optimization.
The Optimizer also employs more dynamic methods of using statistics such as deriving column correlations using multicolumn statistics, adjusting and propagating the number of unique values after each binary join is made, and so on. The methods are developed and generalized in a consistent manner throughout the Optimizer.
Cost and cardinality estimation formulas and the strategies for deriving column correlations depend on the accuracy of the available statistics to produce optimal plans. The software might not produce reasonable plans if its statistical inputs are not accurate. The most common reason statistics are not accurate is because they are stale.
The process of detecting stale statistics and extrapolating data values (see “Using Extrapolation to Replace Stale Statistics” on page 288) when the Optimizer determines that the statistics it needs are stale is designed to reduce the frequency of collecting or recollecting statistics.
The estimated cardinality of a table is critical information that Teradata Database needs to detect stale histograms and to activate extrapolation.
Every histogram contains a cardinality value that is based on the number of rows in the table at the time the statistics were collected. Teradata Database compares the histogram row count with the newly calculated estimate of the table cardinality (see “About Extrapolating Statistics” on page 291) to determine whether a histogram is stale. If the Optimizer determines that a histogram is stale, it applies extrapolation techniques when doing selectivity estimations.
The definition of stale statistics and the problems faced by the Optimizer when dealing with them were introduced earlier in this chapter in the following topics.
You can use one or both of the following methods to enable Teradata Database to recollect statistics only when a limit that you specify with a COLLECT STATISTICS request is exceeded. See “Optimal Times to Collect or Recollect Statistics” on page 192 and “Optimizing the Recollection of Statistics” on page 283 for details about how to do this.
For additional information, see “COLLECT STATISTICS (Optimizer Form)” in SQL Data Definition Language.
The THRESHOLD option enables you to instruct Teradata Database to skip recollecting statistics if either the change in the data from the last statistics collection or the age of the statistics is below the thresholds you specified when you initially collected the statistics.
See “Optimizing Statistics Recollection Using Threshold Options” on page 283 for further information.
Two methods exist for determining sampling defaults.
When Teradata Database determines the sampling percentage, it uses a method of downgrading statistics collection from a 100% sampling to a sampling rate it selects to optimize recollections for a particular column or index set.
See “Optimizing Statistics Recollection Using Sampling” on page 285 for further information.
You can specify the sampling percentage for Teradata Database to use to determine sampling defaults.
The following text describes the four principal ways the cardinality estimation subsystem handles the problem of stale statistics.
The Optimizer uses table‑level SUMMARY statistics to determine the growth of a table. To do this, it compares the cardinality estimate from the current dynamic AMP sample against the cardinality from the residual table‑level statistics. The Optimizer then adds the estimated growth to the cardinality estimate from the SUMMARY statistics to estimate the current cardinality of the table.
In this case, the number of unique values is adjusted in the same way as the total rows described in the first bullet under Adjusting the total cardinality, above.
Soft uniqueness is defined as the case where the interval histogram statistics indicate that the number of unique values is close to the total cardinality for the table.
The scaling operation assumes a constant number of rows per value. For example, assume
that the table cardinality is estimated to be 100 rows from a dynamic AMP sample.
If, for a column x1, the histogram indicates that the row count is 50 and the number of unique values
is 25, meaning the number of Rows Per Value is 2, then the number of unique values
is scaled up to .
Note that the value scaling, whether up or down, cannot be done unconditionally for all columns.
For example, columns like product_id, business_unit_id, and so on cannot have new adjusted values added to them because it makes no semantic sense. Their values are not only arbitrary, but also are not cardinal numbers, so it is meaningless to perform mathematical manipulations on them, particularly in this context.
While inheriting the statistics from the single‑table non‑sparse join or hash index by the parent base table and vice versa, when statistics are available for a column in both the source and destination interval histograms, the system uses the most recent statistics of the 2 as determined by a timestamp comparison.
If the available histogram was created from sampled statistics, then the system always lowers the their confidence level.
You should always keep the statistics for skewed columns and indexes current.
The handling of stale statistics for single‑table cardinality estimates is done using various extrapolation methods.