Stale Statistics | Optimizer Process | Teradata Vantage - 17.10 - Stale Statistics - Advanced SQL Engine - Teradata Database

Teradata Vantage™ - SQL Request and Transaction Processing

Product
Release Number
17.10
Release Date
July 2021
Content Type
Programming Reference
User Guide
Publication ID
B035-1142-171K
Language
English (United States)

Stale statistics are interval histogram statistics that no longer represent an accurate description of the column sets on which they were originally collected.

Interval histogram statistics are the initial source of column demographic information for the Optimizer to use in making its estimates of join cardinality, rows per value, skew adjustments, and several other pieces of information necessary for accurate query optimization.

The Optimizer also employs more dynamic methods of using statistics such as deriving column correlations using multicolumn statistics, adjusting and propagating the number of unique values after each binary join is made, and so on. The methods are developed and generalized in a consistent manner throughout the Optimizer.

Cost and cardinality estimation formulas and the strategies for deriving column correlations depend on the accuracy of the available statistics to produce optimal plans. The software might not produce reasonable plans if its statistical inputs are not accurate. The most common reason statistics are not accurate is because they are stale.

The process of detecting stale statistics and extrapolating data values (see Using Extrapolation to Replace Stale Statistics) when the Optimizer determines that the statistics it needs are stale is designed to reduce the frequency of collecting or recollecting statistics.

Detecting Stale Statistics

The estimated cardinality of a table is critical information that Vantage needs to detect stale histograms and to activate extrapolation.

Every histogram contains a cardinality value that is based on the number of rows in the table at the time the statistics were collected. The database compares the histogram row count with the newly calculated estimate of the table cardinality to determine whether a histogram is stale. If the Optimizer determines that a histogram is stale, it applies extrapolation techniques when doing selectivity estimations.

Using Derived Statistics to Compensate for Stale Statistics

You can use one or both of the following methods to enable the database to recollect statistics only when a limit that you specify with a COLLECT STATISTICS request is exceeded. For details, see Optimal Times to Collect or Recollect Statistics and Optimizing the Recollection of Statistics.

For further details, see the information about COLLECT STATISTICS (Optimizer Form) in Teradata Vantage™ - SQL Data Definition Language Syntax and Examples, B035-1144.
• Set a recollection threshold

The THRESHOLD option enables you to instruct the database to skip recollecting statistics if either the change in the data from the last statistics collection or the age of the statistics is below the thresholds you specified when you initially collected the statistics.

• Collect sampled, rather than full, statistics
The following methods exist for determining sampling defaults:
• You can specify the sampling percentage for the database to use.
• You can enable the database to determine the sampling percentage to use based on a set of criteria that it chooses.

When the database determines the sampling percentage, it uses a method of downgrading statistics collection from a 100% sampling to a sampling rate it selects to optimize recollections for a particular column or index set.

You can specify the sampling percentage for the database to use to determine sampling defaults.

There are four principal ways the cardinality estimation subsystem handles the problem of stale statistics:
The Optimizer uses table-level SUMMARY statistics to determine the growth of a table. To do this, it compares the cardinality estimate from the current dynamic AMP sample against the cardinality from the residual table-level statistics. The Optimizer then adds the estimated growth to the cardinality estimate from the SUMMARY statistics to estimate the current cardinality of the table.
• If the primary index is not skewed and if the deviation between the sample and the interval histogram row count is more than the default percentage, the sample row count is taken as the cardinality of the table. The skew can be identified from the high modal frequency value stored in the interval histogram.
• If no SUMMARY statistics are available, the sampled row count is taken to be the table cardinality as is done by the current Optimizer software.
2. Adjusting the number of unique values.

For indexes, the cardinality is estimated as the total number of unique values of the indexed columns. In this case, the number of unique values is adjusted in the same way as for adjusting total cardinality.

If the adjusted table row count deviates more than a defined percentage from the histogram row count on unindexed columns, then the number of unique values, which is the principal value input to the join cardinality and costing related formulas, is, assuming a uniform distribution, either scaled up or down for the following scenarios:
• The histogram is on a DATE column.
• The histogram is on multiple columns that have a DATE column component.
• The histogram is on a unique index.
• The histogram is on a soft unique column. Soft uniqueness is defined as the case where the interval histogram statistics indicate that the number of unique values is close to the total cardinality for the table.
The scaling operation assumes a constant number of rows per value. For example, assume that the table cardinality is estimated to be 100 rows from a dynamic AMP sample. If, for a column x1, the histogram indicates that the row count is 50 and the number of unique values is 25, meaning the number of Rows Per Value is 2, then the number of unique values is scaled up to .

Note that the value scaling, whether up or down, cannot be done unconditionally for all columns.

For example, columns like product_id, business_unit_id, and so on cannot have new adjusted values added to them because it makes no semantic sense. Their values are not only arbitrary, but also are not cardinal numbers, so it is meaningless to perform mathematical manipulations on them, particularly in this context.

3. Considering statistics collection timestamps

While inheriting the statistics from the single-table non-sparse join or hash index by the parent base table and vice versa, when statistics are available for a column in both the source and destination interval histograms, the system uses the most recent statistics of the 2 as determined by a timestamp comparison.