Stale statistics are interval histogram statistics that no longer accurately describe the column sets on which they were originally collected.
Interval histogram statistics are the initial source of column demographic information for the Optimizer to use in making its estimates of join cardinality, rows per value, skew adjustments, and other pieces of information necessary for accurate query optimization.
The Optimizer also uses more dynamic methods of using statistics such as deriving column correlations using multicolumn statistics, adjusting and propagating the number of unique values after each binary join is made, and so on. The methods are developed and generalized in a consistent manner throughout the Optimizer.
Cost and cardinality estimation formulas and the strategies for deriving column correlations depend on the accuracy of the available statistics to produce optimal plans. If statistical inputs are inaccurate, the software may not produce reasonable plans. The most common reason for inaccurate statistics is staleness.
The process of detecting stale statistics and extrapolating data values (see Using Extrapolation to Replace Stale Statistics) when the Optimizer determines that its needed statistics are stale is designed to reduce the frequency of collecting or recollecting statistics.
Detecting Stale Statistics
The estimated cardinality of a table is critical information that Vantage must detect stale histograms and to activate extrapolation.
Every histogram contains a cardinality value that is based on the number of rows in the table at the time the statistics were collected. The database compares the histogram row count with the newly calculated estimate of the table cardinality to determine whether a histogram is stale. If a histogram is stale, the Optimizer applies extrapolation techniques when doing selectivity estimations.
Using Derived Statistics to Compensate for Stale Statistics
You can use one or both of the following methods to enable the database to recollect statistics only when a limit that you specify with a COLLECT STATISTICS request is exceeded. For details, see Optimal Times to Collect or Recollect Statistics and Optimizing the Recollection of Statistics.
- Set a recollection threshold
The THRESHOLD option enables you to instruct the database to skip recollecting statistics if either the change in the data from the last statistics collection or the age of the statistics is below the thresholds you specified when you initially collected the statistics.
For more information, see Optimizing Statistics Recollection Using Threshold Options.
- Collect sampled, rather than full, statistics The following methods exist for determining sampling defaults:
- You can specify the sampling percentage for the database to use.
- You can enable the database to determine the sampling percentage.
The database chooses the sampling percentage to optimize recollections for a column or index set.
For more information, see Optimizing Statistics Recollection Using Sampling.
You can specify the sampling percentage for the database to use to determine sampling defaults.
- Adjusting the total cardinality The Optimizer uses table-level SUMMARY statistics to determine the growth of a table by comparing the cardinality estimate from the current dynamic AMP sample against the cardinality from the residual table-level statistics. The Optimizer then adds the estimated growth to the cardinality estimate from the SUMMARY statistics to estimate the current cardinality of the table.
- If the primary index is not skewed and if the deviation between the sample and the interval histogram row count is more than the default percentage, the sample row count is taken as the cardinality of the table. The skew can be identified from the high modal frequency value stored in the interval histogram.
- If no SUMMARY statistics are available, the sampled row count is taken to be the table cardinality as is done by the current Optimizer software.
- Adjusting the number of unique values.
For indexes, the cardinality is estimated as the total number of unique values of the indexed columns. The number of unique values is adjusted in the same way as for adjusting total cardinality.
If the adjusted table row count deviates more than a defined percentage from the histogram row count on unindexed columns, then the number of unique values, which is the principal value input to the join cardinality and costing related formulas, is, assuming a uniform distribution, either scaled up or down for the following scenarios:- The histogram is on a DATE column.
- The histogram is on multiple columns that have a DATE column component.
- The histogram is on a unique index.
- The histogram is on a soft unique column. Soft uniqueness is defined as the case where the interval histogram statistics indicate that the number of unique values is close to the total cardinality for the table.
The value scaling, whether up or down, cannot be done unconditionally for all columns.
For example, adjusting the values of product_id and business_unit_id columns makes no semantic sense, because their values are not arbitrary and not cardinal numbers.
- Considering statistics collection timestamps
While inheriting the statistics from the single-table non-sparse join index by the parent base table and the reverse, when statistics are available for a column in both the source and destination interval histograms, the system uses the most recent statistics of the 2 as determined by a timestamp comparison.
- Adjusting the confidence level
If the available histogram was created from sampled statistics, then the system lowers the their confidence level.
Keep the statistics for skewed columns and indexes current.
The handling of stale statistics for single-table cardinality estimates is done using extrapolation methods.