As was mentioned briefly in the topic Using Derived Statistics to Compensate for Stale Statistics, the COLLECT STATISTICS SQL statement supports two methods of restricting the database from recollecting statistics when there is no reason to do so (see Teradata Vantage™ - SQL Data Definition Language Syntax and Examples, B035-1144 for details about these methods).
To accomplish this, you can specify to collect sampled, rather than full, statistics.
- You can specify the sampling percentage for the database to use.
- You can enable the database to determine the sampling percentage to use based on a set of criteria that it chooses.
When the database determines the sampling percentage, it uses a method of downgrading statistics collection from a 100% sampling to a sampling rate it selects to optimize recollections for a particular column or index set.
- Set a recollection threshold
- Specify to collect sampled, rather than full, statistics Two methods exist for determining sampling defaults.
- You can specify the sampling percentage for the database to use.
- You can enable the database to determine the sampling percentage to use based on a set of criteria that it chooses.
When the database determines the sampling percentage, it uses a method of downgrading statistics collection from a 100% sampling to a sampling rate it selects to optimize recollections for a particular column or index set.
Optimizing Statistics Recollection Using Threshold Options
The COLLECT STATISTICS (Optimizer Form) statement supports several options to specify thresholds for recollecting statistics. These options enable you to specify various criteria that the database can use to determine whether it should ignore a request you submit to recollect statistics on a column or index set.
- Percentage of change in the data, or change-based recollection.
The criterion used is based on how much the data has changed since statistics were last collected, using the percentage of change since the last recollection of statistics for the specified column or index set as the threshold.
- Age, or time-based recollection.
The criterion used is based on the age of the statistics, using the number of days that have passed since the last recollection of statistics for the specified column or index set as the threshold.
When you submit a COLLECT STATISTICS request that specifies a threshold not to recollect statistics if the data change from the last statistics collection or if the age of the statistics is below the specified thresholds for the recollection of those statistics.
- Enabling the database to determine to determine appropriate thresholds.
When you specify a system-determined threshold option, the Optimizer automatically determine the appropriate thresholds to use based on previously collected statistics, column or index historical data, UDI counts, and other factors.
- Specifying your own user-determined thresholds.
When you specify a threshold, you do so by specifying a change percentage, an aging criterion based on a number of days, or both.
By specifying thresholds for recollecting statistics, you eliminate the responsibility and burden of determining when to refresh statistics. When you specify threshold options, you can then submit COLLECT STATISTICS requests at regular intervals, and the Optimizer can determine intelligently whether to skip recollecting specified statistics whose thresholds are not met, to recollect statistics for those columns or indexes that are over the threshold, or both.
For example, if either you or the database determines the threshold to be a 10% data change, and you submit a request to recollect statistics on a column or index set, the Optimizer does not recollect those statistics if the change in the data since the last collection of statistics is less than 10%.
If you decide that you want to recollect a set of statistics regardless of the specified thresholds, you can submit your COLLECT STATISTICS request and specify the keyword phrase FOR CURRENT with the option set you want to override. In this case, the database ignores the saved thresholds for the current recollection only and returns to the saved thresholds the next time you recollect statistics for the specified column or index set.
If you specify one or more thresholds the first time you collect statistics on a column or index set, the Optimizer collects the statistics and saves the threshold options that you specify. When you recollect statistics on a column or index set, the database uses the saved threshold options to determine whether the current statistics recollection request should be ignored or obeyed.
If you specify a new threshold for recollecting statistics, the new threshold overrides the previously existing threshold, and the database applies the specified threshold for the current request and stores it for future recollections of the specified statistics.
- Change-based methods
- Time-based methods
For change-based thresholds, the Optimizer consults the system-maintained UDI counts, random AMP samples, and any available column or index history to determine the extent of data change from the last statistics collection.
If you specify either the SYSTEM THRESHOLD or SYSTEM THRESHOLD PERCENT options for your COLLECT STATISTICS requests, the Optimizer considers column and index histories and extrapolations methods to determine the appropriate change threshold.
See Using Extrapolation to Replace Stale Statistics for information about the various extrapolations the Optimizer can use.
The Optimizer also evaluates information such as small table NUSI statistics to determine whether to recollect the requested statistics or not. For such small database objects, the Optimizer tends to collect full statistics because collecting full statistics provides up-to-date statistics for a small cost.
Time-based thresholds cause the Optimizer to evaluate the age of current statistics to determine whether they need to be refreshed or not.
Of course, you can also specify various combinations of change-based and time-based thresholds in addition to various sampling options to reduce the cost of recollecting statistics for a column or index set.
Using the thresholds approach enables you to submit requests to recollect statistics regularly without needing to determine whether it is too soon to do so because if the specified recollection thresholds are not met, the database avoids recollecting the specified statistics. By not recollecting unnecessary statistics, the database can avoid wasting costly CPU and I/O resources to recollect statistics unnecessarily. And as mentioned earlier, by using system-determined thresholds, the Optimizer can determine automatically whether a recollection of statistics is required or whether extrapolation is an adequate substitute method of estimating the statistics values for the specified column or index set.
You can query the Data Dictionary table column DBC.StatsTbl.LastCollectTimeStamp to determine whether your COLLECT STATISTICS requests to recollect statistics have been recognized or skipped. See Teradata Vantage™ - Data Dictionary, B035-1092 for more information about DBC.StatsTbl and the system views you can use to access the table.
Optimizing Statistics Recollection Using Sampling
You can collect sampled statistics when you submit a COLLECT STATISTICS request using either a system-determined sampling percentage or a user-specified sampling percentage.
If you specify a system-determined sampling percentage, the database uses a downgrade approach to determine the appropriate time to switch from collecting full statistics to sampling. With this approach, the Optimizer determines when to honor sampling and to what extend, up to collecting full statistics, without sacrificing the quality. The downgrade approach works as follows:
- When you first submit a COLLECT STATISTICS request that specifies sampling, the Optimizer initially collects full statistics. Collecting full statistics provides the Optimizer with enough information to determine when it can downgrade to a smaller percentage.
This is unlike first collecting a small sample and trying to determine if a larger sample would be more appropriate.
- On subsequent requests to recollect the same statistics, the Optimizer continues to collect full statistics until it has collected enough statistics to capture an adequate statistical history.
- Once it has collected an adequate statistics history, the Optimizer can recognize the nature of the column or index, whether it is skewed, rolling, or static, and so on.
The Optimizer then considers the column usage from either detailed buckets or merely summary data that it maintains in the DBC.StatsTbl.UsageType column and the user-specified number of intervals to determine the appropriate time to downgrade from collecting full statistics to collecting sampled statistics.
Vantage is more aggressive in downgrading from collecting full statistics to collecting sampled statistics for the histograms whose summary data is used but not the detailed intervals.
- The Optimizer then determines a sample percentage (2% to 100%) for recollecting statistics and automatically determines the appropriate formula to use for scaling based on the history and nature of the column.
- The recollected statistics are compared to the history for quality. The percentage is lowered only if the Optimizer can determine safely that a smaller percentage provides quality results.
This approach removes your responsibility and burden to make a decision on which columns to collect sampled statistics. Moreover, by collecting full statistics initially, it provides the Optimizer with information needed for making an intelligent decision about how much to sample in subsequent recollections.
For example, for small tables, skewed columns, column that is member of the partitioning expression column set and columns that require detailed buckets, the Optimizer never switches to sample statistics at less than 100%.
As another example, the Optimizer can detect very nonunique columns in the full statistics and, in subsequent recollects, it can intelligently switch to sampled statistics at a much lower percentage than 100%, perhaps even as low as 2%, apply the appropriate scaling formula applicable for nonunique columns to obtain the extrapolated full statistics, and obtain quality statistics much faster than with full statistics.
For rolling, unique, and near-unique columns as detected in the full statistics, the Optimizer can subsequently collect sample statistics at a much lower percentage and apply a linear scaling formula to produce the extrapolated full statistics.
After only a few sample statistics collections (the number is based on the demographic pattern and UDI counts of the data), the Optimizer can decide to recollect statistics using a 100% statistics collection to verify and adjust as needed the sample percentage and the extrapolation method used.
Sampled statistics are not supported for single-table views.
You can query the dictionary table column DBC.StatsTbl.SampleSizePct to determine the sample percentage used by the database to collect the statistics. For a user-specified sample percentage, the actual sample size used to build the histogram might be the same, slightly higher, or slightly lower than the specified sampling percentage.