Extrapolating Statistics When Histogram Statistics Are Stale
Users often query their databases over date intervals for which one or both bounds on a predicate are dates for rows that have been inserted into a table since the last time statistics were collected for the date columns specified for that table in the predicate.
Extrapolating statistics enables you to submit, for example, a query specifying a date range predicate in which one or all of the specified dates is past the latest date stored in the statistical histograms for the DATE column set. To support such queries, the Optimizer applies an extrapolation technique to make a reasonable estimate for the rows that have been inserted after statistics were last collected without requiring recollection. Extrapolation methods for statistics can be viewed as a form of derived statistics (see Derived Statistics).
In this context, the definition of a future date applies only to dates occurring between the time statistics were last collected on a DATE column and the current date. That is, future is a narrowly defined, relative term, and extrapolation does not apply to data having true future dates that cannot exist at the time you submit your date-extrapolation-requiring queries. Rather, the term applies to dates that are otherwise in a statistical limbo.
- Table cardinality Vantage uses extrapolation from the following information to update table cardinalities.
- Histogram summary information
- System UDI counts if available and reliable
- History data
- Single-AMP or all-AMP sampling
- Stale histograms, both for the number of distinct values and for the maximum value of the histogram
Extrapolation logic estimates the data growth from the last time statistics were collected or refreshed to detect stale statistics and apply extrapolations. The Optimizer combines cardinality estimates derived from both UDI counts and extrapolation logic to make accurate cardinality estimates, and making stale statistics detection more robust. UDI counts also enable the Optimizer to perform interpolations because of the improved accuracy of data changes provided by the tracking of UDI counts.
Vantage also uses extrapolation to estimate cardinalities based on dynamic single- or all-AMP samples.
The ability of the Optimizer to extrapolate statistics does not remove the need to recollect statistics.
Relative Accuracy of Residual Statistics and Dynamic AMP Sampled Statistics for Static Columns
The relative accuracy of residual statistics (statistics not collected recently) for static columns with respect to a dynamic single-AMP sample, and whether residual statistics must be considered to be stale, depends on multiple factors. Residual statistics are not necessarily stale statistics.
The term residual statistics implies the likelihood that those statistics no longer provide an accurate picture of current column data. This is not necessarily a correct assumption.
You can protect yourself from recollecting statistics unnecessarily by specifying threshold options when you submit your initial COLLECT STATISTICS requests on an index or column set. See COLLECT STATISTICS (Optimizer Form) for more information about using threshold options to recollect statistics.
If the relative demographics for a column set do not change, then residual statistics are typically a reliable representation of the current demographics of an index or column set. The comparison made by the derived statistics framework between residual statistics, if any, and a dynamic AMP sample makes the final determination of whether the Optimizer considers the residual statistics to be stale or not.
The cardinality of a table may grow by more than 10%, but for the relative proportion of values in its rows not to change. This so-called Ten Percent Rule also applies at the partitioning level for row-partitioned tables. If the number of changed row partitions exceeds 10% of the total number of row partitions for the table (that is, if more than 10% of the rows are added to or deleted from a row partition), recollect statistics on the index. For row-partitioned tables and column-partitioned tables, any refreshment of statistics must include the system-derived PARTITION column.
As a result of these safeguards, even a change in the table demographics of this magnitude may not affect the query plan generated by the Optimizer.
When this is true, residual statistics can be more accurate than a newly collected dynamic AMP sample, though even a single-AMP dynamic sample, depending on whether the column being sampled is indexed, typically provides excellent cardinality estimates (see Dynamic AMP Sampling). The accuracy of the statistics collected from a dynamic AMP sample also depends to a relatively small degree on the number of AMPs sampled, which is determined by the setting of an internal field. The possibilities range through 1, 2, or 5 AMPs, all AMPs on a node, or all AMPs on a system.
Using Extrapolated Cardinality Estimates to Assess Table Growth
The Optimizer uses SUMMARY statistics to establish an estimate of the base cardinality of a table. Vantage then extrapolate the cardinality by comparing the saved dynamic sample from the SUMMARY statistics with the fresh sample. The system takes the difference between the new and old samples as the growth in cardinality of the table.
The Optimizer estimates table cardinalities from SUMMARY statistics. You can collect or recollect SUMMARY statistics with a COLLECT SUMMARY STATISTICS request or by collecting or recollecting statistics on any column set that indirectly updates the SUMMARY statistics, and dynamic AMP samples or collected PARTITION statistics (which can be gathered for all tables, partitioned or not). The collection of SUMMARY statistics takes advantage of optimizations that make the process run quickly. The ideal choice, is to provide accurate cardinality estimates to the Optimizer using the method with the least overhead.
The database compares the fresh dynamic AMP sample fetched at the time a request is compiled with the sample from the SUMMARY statistics, and the change in growth of the table is determined from that comparison. Even if the absolute estimate of the sample is not exact, the rate of growth is accurate if uniform across all of the AMPs on the system because the same AMP is used to collect a "dynamic" statistical sample each time a request is compiled (see Dynamic AMP Sampling for details).
Comparison Scenarios on the Usefulness of Derived Statistics
Suppose the Optimizer evaluates statistics to determine if NUSIs are to be used in the query plan. The standard evaluation criterion for determining the usefulness of a NUSI in a query plan is that the number of rows that qualify per data block must be less than 1.
Assume an average data block size of 50 KB and an average row size of 50 bytes. This produces an average of 1,000 rows per data block. Suppose the number of rows for a NUSI is 1 in every 2,000 rows, or one row in every 2 data blocks. The Optimizer determines that using the NUSI in the query plan causes fewer I/Os than doing a full-table scan.
Now, assume the table grows by 10%. The number of qualifying rows is now 1 in every 2,200 rows (a 10% increase in the cardinality of the table). For this case, the number of rows per data block is still less than 1, so the Optimizer does not need new statistics to produce a good query plan, and the derived statistics framework detects this.
However, consider a join scenario in which the Optimizer must estimate the number of rows that qualify for spooling. This can be critical, especially if the original estimate is near the cusp of the crossover point where a 10% increase in the number of rows makes the Optimizer change its selection from one join plan to another.
Without the derived statistics framework being able to detect whether the residual statistics are stale, working with the residual statistics may mean that the Optimizer chose a bad query plan instead of a new, faster plan. Or, worse still, the residual statistics could produce a new join plan that is much slower than the previous plan.