Extrapolating Statistics When Histogram Statistics Are Stale
Users often query their databases over date intervals for which one or both bounds on a predicate are dates for rows that have been inserted into a table since the last time statistics were collected for the date columns specified for that table in the predicate.
Extrapolating statistics enables you to submit, for example, a query specifying a date range predicate in which one or all of the specified dates is past the latest date stored in the statistical histograms for the DATE column set. To support such queries, the Optimizer applies an extrapolation technique to make a reasonable estimate for the rows that have been inserted after statistics were last collected without requiring them to be recollected. Extrapolation methods for statistics can be viewed as a form of derived statistics (see Derived Statistics).
In this context, the definition of a future date applies only to dates occurring between the time statistics were last collected on a DATE column and the current date. In other words, future is a narrowly defined, relative term, and extrapolation does not apply to data having true future dates that cannot exist at the time you submit your date-extrapolation-requiring queries. Rather, the term applies to dates that are otherwise in a statistical limbo.
- Table cardinality Vantage uses extrapolation from the following information to update table cardinalities.
- Histogram summary information
- System UDI counts if they are available, reliable, and not obsolete
- History data
- Single-AMP or all-AMP sampling
- Stale histograms, both for the number of distinct values and for the maximum value of the histogram
Extrapolation logic estimates the data growth from the last time statistics were collected or refreshed in order to detect stale statistics and apply extrapolations. The Optimizer combines cardinality estimates derived from both UDI counts and extrapolation logic to make accurate cardinality estimates, and making stale statistics detection more robust. UDI counts also enable the Optimizer to perform interpolations because of the improved accuracy of data changes provided by the tracking of UDI counts.
Vantage also uses extrapolation to estimate cardinalities based on dynamic single- or all-AMP samples.
The ability of the Optimizer to extrapolate statistics does not remove the need to recollect statistics.
Relative Accuracy of Residual Statistics Versus Dynamic AMP Sampled Statistics for Static Columns
The relative accuracy of residual (residual meaning existing statistics that were not collected recently) statistics for static columns with respect to a dynamic AMP sample (in this section, the term dynamic AMP sample refers to dynamic single-AMP sampling only.), and whether residual statistics should be considered to be stale, depends on several factors. Residual statistics are not necessarily stale statistics.
The term residual statistics implies the likelihood that those statistics no longer provide an accurate picture of current column data. This is not necessarily a correct assumption.
You can protect yourself from recollecting statistics unnecessarily by specifying various threshold options when you submit your initial COLLECT STATISTICS requests on an index or column set. See the information about COLLECT STATISTICS (Optimizer Form) in Teradata Vantage™ - SQL Data Definition Language Syntax and Examples, B035-1144 for more information about using threshold options to recollect statistics.
If the relative demographics for a column set do not change, then residual statistics are normally a reliable representation of the current demographics of an index or column set. The comparison made by the derived statistics framework between residual statistics, if they exist, and a dynamic AMP sample makes the final determination of whether the Optimizer considers the residual statistics to be stale or not.
It is possible for the cardinality of a table to grow by more than 10%, but for the relative proportion of particular values in its rows not to change. This so-called Ten Percent Rule also applies at the partitioning level for row-partitioned tables. If the number of changed row partitions exceeds 10% of the total number of row partitions for the table (in other words, if more than 10% of the rows are added to or deleted from a row partition), then you should recollect statistics on the index. For row-partitioned tables and column-partitioned tables, any refreshment of statistics should include the system-derived PARTITION column.
As a result of these safeguards, even a change in the table demographics of this magnitude might not affect the query plan generated by the Optimizer.
When this is true, residual statistics can still be more accurate than a newly collected dynamic AMP sample, though even a single-AMP dynamic AMP sample, depending to some degree on whether the column being sampled is indexed or not, typically provides excellent cardinality estimates (see, for example, Dynamic AMP Sampling). The accuracy of the statistics collected from a dynamic AMP sample also depends to a relatively small degree on the number of AMPs sampled, which is determined by the setting of an internal DBS Control field. The possibilities range through 1, 2, or 5 AMPs, all AMPs on a node, or all AMPs on a system. Consult your Teradata support representative for details.
Using Extrapolated Cardinality Estimates to Assess Table Growth
The Optimizer uses SUMMARY statistics to establish an estimate of the base cardinality of a table. Vantage then extrapolate the cardinality by comparing the saved dynamic sample from the SUMMARY statistics with the fresh sample. The system takes the difference between the new and old samples as the growth in cardinality of the table.
The Optimizer estimates table cardinalities from SUMMARY statistics, which you can collect or recollect explicitly by submitting an appropriate COLLECT SUMMARY STATISTICS request or by collecting or recollecting statistics on any column set that indirectly updates the SUMMARY statistics, and dynamic AMP samples or collected PARTITION statistics, which can be gathered for all tables, whether they are partitioned or not. The collection of SUMMARY statistics takes advantage of several optimizations that make the process run very quickly. The ideal choice, then, is to provide accurate cardinality estimates to the Optimizer using the method with the least overhead.
The database compares the fresh dynamic AMP sample fetched at the time a request is compiled with the sample from the SUMMARY statistics, and the change in growth of the table is determined from that comparison. Even if the absolute estimate of the sample is not exact, the rate of growth is accurate as long as it is uniform across all of the AMPs on the system because the same AMP is used to collect a “dynamic” statistical sample each time a request is compiled (see Dynamic AMP Sampling for details).
Some Comparison Scenarios on the Usefulness of Derived Statistics
Suppose the Optimizer evaluates statistics to determine if NUSIs are to be used in the query plan. The standard evaluation criterion for determining the usefulness of a NUSI in a query plan is that the number of rows that qualify per data block should be less than 1.
Assume an average data block size of 50 KB and an average row size of 50 bytes. This produces an average of 1,000 rows per data block. Suppose the number of rows for a particular NUSI is 1 in every 2,000 rows, or one row in every 2 data blocks. The Optimizer determines that using the NUSI will save reading some significant number of data blocks, so employing it in the query plan would result in fewer I/Os than doing a full-table scan.
Now, assume the table grows by 10%. The number of qualifying rows is now 1 in every 2,200 rows (a 10% increase in the cardinality of the table). For this particular case, the number of rows per data block is still less than 1, so the Optimizer does not need new statistics to produce a good query plan, and the derived statistics framework detects this.
On the other hand, consider a join scenario in which the Optimizer needs to estimate how many rows will qualify for spooling. This can be critical, especially if the original estimate is near the cusp of the crossover point where a 10% increase in the number of rows makes the Optimizer change its selection from one join plan to another.
Without the derived statistics framework being able to detect whether the residual statistics are stale or not, working with them could have meant that the Optimizer would have chosen a bad query plan instead of a new, faster plan. Or, worse still, the residual statistics could produce a new join plan that is much slower than the previous plan.