Stale Statistics | Optimizer Process | VantageCloud Lake - Stale Statistics - Teradata Vantage

Teradata® VantageCloud Lake

Deployment
VantageCloud
Edition
Lake
Product
Teradata Vantage
Published
January 2023
ft:locale
en-US
ft:lastEdition
2024-12-11
dita:mapPath
phg1621910019905.ditamap
dita:ditavalPath
pny1626732985837.ditaval
dita:id
phg1621910019905

Stale statistics are interval histogram statistics that no longer accurately describe the column sets on which they were originally collected.

Interval histogram statistics are the initial source of column demographic information for the Optimizer to use in making its estimates of join cardinality, rows per value, skew adjustments, and other pieces of information necessary for accurate query optimization.

The Optimizer also uses more dynamic methods of using statistics such as deriving column correlations using multicolumn statistics, adjusting and propagating the number of unique values after each binary join is made, and so on. The methods are developed and generalized in a consistent manner throughout the Optimizer.

Cost and cardinality estimation formulas and the strategies for deriving column correlations depend on the accuracy of the available statistics to produce optimal plans. If statistical inputs are inaccurate, the software may not produce reasonable plans. The most common reason for inaccurate statistics is staleness.

The process of detecting stale statistics and extrapolating data values (see Using Extrapolation to Replace Stale Statistics) when the Optimizer determines that its needed statistics are stale is designed to reduce the frequency of collecting or recollecting statistics.

Detecting Stale Statistics

The estimated cardinality of a table is critical information that Vantage must detect stale histograms and to activate extrapolation.

Every histogram contains a cardinality value that is based on the number of rows in the table at the time the statistics were collected. The database compares the histogram row count with the newly calculated estimate of the table cardinality to determine whether a histogram is stale. If a histogram is stale, the Optimizer applies extrapolation techniques when doing selectivity estimations.

Using Derived Statistics to Compensate for Stale Statistics

You can use one or both of the following methods to enable the database to recollect statistics only when a limit that you specify with a COLLECT STATISTICS request is exceeded. For details, see Optimal Times to Collect or Recollect Statistics and Optimizing the Recollection of Statistics.

For further details, see COLLECT STATISTICS (Optimizer Form).
  • Set a recollection threshold

    The THRESHOLD option enables you to instruct the database to skip recollecting statistics if either the change in the data from the last statistics collection or the age of the statistics is below the thresholds you specified when you initially collected the statistics.

    For more information, see Optimizing Statistics Recollection Using Threshold Options.

  • Collect sampled, rather than full, statistics
    The following methods exist for determining sampling defaults:
    • You can specify the sampling percentage for the database to use.
    • You can enable the database to determine the sampling percentage.

      The database chooses the sampling percentage to optimize recollections for a column or index set.

      For more information, see Optimizing Statistics Recollection Using Sampling.

You can specify the sampling percentage for the database to use to determine sampling defaults.

There are four principal ways the cardinality estimation subsystem handles the problem of stale statistics:
  1. Adjusting the total cardinality
    The Optimizer uses table-level SUMMARY statistics to determine the growth of a table by comparing the cardinality estimate from the current dynamic AMP sample against the cardinality from the residual table-level statistics. The Optimizer then adds the estimated growth to the cardinality estimate from the SUMMARY statistics to estimate the current cardinality of the table.
    • If the primary index is not skewed and if the deviation between the sample and the interval histogram row count is more than the default percentage, the sample row count is taken as the cardinality of the table. The skew can be identified from the high modal frequency value stored in the interval histogram.
    • If no SUMMARY statistics are available, the sampled row count is taken to be the table cardinality as is done by the current Optimizer software.
  2. Adjusting the number of unique values.

    For indexes, the cardinality is estimated as the total number of unique values of the indexed columns. The number of unique values is adjusted in the same way as for adjusting total cardinality.

    If the adjusted table row count deviates more than a defined percentage from the histogram row count on unindexed columns, then the number of unique values, which is the principal value input to the join cardinality and costing related formulas, is, assuming a uniform distribution, either scaled up or down for the following scenarios:
    • The histogram is on a DATE column.
    • The histogram is on multiple columns that have a DATE column component.
    • The histogram is on a unique index.
    • The histogram is on a soft unique column. Soft uniqueness is defined as the case where the interval histogram statistics indicate that the number of unique values is close to the total cardinality for the table.
    The scaling operation assumes a constant number of rows per value. For example, assume that the table cardinality is estimated to be 100 rows from a dynamic AMP sample. If, for a column x1, the histogram indicates that the row count is 50 and the number of unique values is 25, meaning the number of Rows Per Value is 2, then the number of unique values is scaled up to .

    The value scaling, whether up or down, cannot be done unconditionally for all columns.

    For example, adjusting the values of product_id and business_unit_id columns makes no semantic sense, because their values are not arbitrary and not cardinal numbers.

  3. Considering statistics collection timestamps

    While inheriting the statistics from the single-table non-sparse join index by the parent base table and the reverse, when statistics are available for a column in both the source and destination interval histograms, the system uses the most recent statistics of the 2 as determined by a timestamp comparison.

  4. Adjusting the confidence level

    If the available histogram was created from sampled statistics, then the system lowers the their confidence level.

Keep the statistics for skewed columns and indexes current.

The handling of stale statistics for single-table cardinality estimates is done using extrapolation methods.