Reasons to Use Sampled Statistics Instead of Full Statistics
There are occasions when it becomes difficult to expend the effort required to collect full-table statistics. For example, suppose you have a multibillion row table that requires several hours collection time, and a collection window that is too narrow to permit the operation. If you encounter situations like this when it is not possible to collect full-table statistics in a reasonable amount of time, you can opt to collect statistics on a sample of table rows instead.
You can specify several different statistics sampling mechanisms with the USING option of a COLLECT STATISTICS request, but you cannot collect sampled statistics with a SAMPLE USING option if a column that you specify is a component of the partitioning expression of a row-partitioned table or column-partitioned table.
You can collect statistics on such a column using the COLLECT STATISTICS USING option if the specified column is both a member of the partitioning expression column set and a member of an index column set.
Sampled statistics are different from dynamic AMP samples in that you can specify the percentage of rows you want to sample explicitly in a COLLECT STATISTICS (Optimizer Form) request to collect sampled statistics, while the number of AMPs from which dynamic AMP samples are collected and the time when those samples are collected is determined by Teradata Database, not by user choice. You can also specify that Teradata Database should determine the sampling percentage of COLLECT STATISTICS samples. Furthermore, sampled statistics produce a full set of collected statistics, while dynamic AMP samples collect only a subset of the statistics that are stored in interval histograms (see Dynamic AMP Sampling for details).
Statistical sampling is known to be a reliable method of gathering approximate statistical estimates when the appropriate preconditions are met.
When you use sampled statistics rather than full-table statistics, you are trading time in exchange for what are likely to be less accurate statistics. The underlying premise for using sampled statistics is usually that sampled statistics are better than no statistics.
Comparing the Accuracy of Sampled Statistics and Dynamic AMP Samples
Do not confuse statistical sampling with the dynamic AMP samples that the Optimizer collects when it has no statistics on which to base a query plan. Statistical samples taken across all AMPs are likely to be much more accurate than dynamic AMP samples.
The following table describes some of the differences between the 2 methods of collecting statistics:
|Statistical Sampling||Dynamic AMP Sampling|
|Collects statistics on a small sample of rows from all AMPs.
If the columns are not indexed, then the rows are organized randomly on each AMP, so Teradata Database just scans the first n percent of rows it finds, where the value of n is determined by the relative presence or absence of skew in the data. Conceivably, the entire sample could be taken from the first data block on each AMP, depending on the system configuration and cardinality of the table being sampled.
If the columns are indexed, then more sophisticated sampling is performed to take advantage of the hash-sequenced row ordering.
|Collects statistics on a small sample of rows from a single AMP.
This is the system default.
You can change the number of AMPs from which a dynamic AMP sample is taken by altering the value of an internal DBS Control field. Ask your Teradata Support Center representative for details.
|Collects full statistics and stores them in interval histograms in DBC.StatsTblStatsTbl.
See also Content and Storage of Histograms.
|Collects estimates for cardinality, number of distinct index values, and a few other statistics only and stores them in the data block descriptor.
Also collects and stores single-AMP and all-AMP cardinality estimates in interval histograms in DBC.StatsTblStatsTbl.
|Expands sample size dynamically to compensate for skew.||Sensitive to skew.|
|Provides fairly accurate estimates of all statistical parameters.||Provides fairly accurate estimates of base table and NUSI cardinality if the following conditions are met.
Other standard statistical parameters are less likely to be as accurate.
You cannot sample single-column PARTITION statistics at a level lower than 100%. You can submit a COLLECT STATISTICS request that specifies a lower percentage without the request aborting and without receiving an error message, but Teradata Database does not honor the specified percentage, and it automatically changes the sampling percentage to 100 internally (for details, see the documentation for “COLLECT STATISTICS (Optimizer Form)” in SQL Data Definition Language).
Recollecting Sampled Statistics
When you recollect statistics on a column or index, whether for the Optimizer or for a QCD analysis, Teradata Database uses the same method of collection to recollect the statistics as was used for the initial collection, either full-table scan or sampling. You can control the default sampling percentage for recollecting statistics using the cost profile constant StatsSysSampleOption or the DBS Control field SysSampleOption.
For more information, see Optimizer Cost Profiles or Utilities.
The only exception to this is the case where you specify USING SYSTEM SAMPLE when you first collect statistics, but do not specify the FOR CURRENT option for a recollection. The USING SYSTEM SAMPLE option initially collects statistics using a full-table scan, but then downgrades the initial 100% sampling to increasingly lower sampling percentages until it reaches a point where the sampling percentage is not reduced any further.
If you had collected statistics on the column as part of an index on a row-partitioned table, then Teradata Database follows the general rule and recollects sampled statistics on the index column.
If you would like to change the collection options, you must submit a new COLLECT STATISTICS request with the new collection options fully specified or change the default sampling percentage using the cost profile constant StatsSysSampleOption or the DBS Control field SysSampleOption. This does not apply for sampled COLUMN statistics on a component of the partitioning expression for a row-partitioned table or column-partitioned table, which are not valid, and which you cannot specify in a COLLECT STATISTICS request. Teradata Database always collects (and recollects) statistics on all the rows in the table for this particular case.
If you want to recollect statistics for a histogram, but override the options that have been saved for that histogram, you can specify the FOR CURRENT option. FOR CURRENT recollects statistics using whatever options you specify for the current COLLECT STATISTICS request only, and reverts to the saved options for subsequent recollections.
You can collect sampled statistics using either a system-determined sampling percentage or a user-specified sampling percentage. For a user-specified sample percentage, Teradata Database reads the specified percent of rows to collect statistics. All the existing rules apply if the sample percent is explicitly specified.
If you specify the USING SYSTEM SAMPLE option when you collect statistics, the Optimizer determines the appropriate time to switch from collecting full statistics to collecting sampled statistics. With this approach, the Optimizer decides both when to honor sampling and the degree of sampling, up to collecting full statistics, without sacrificing the quality of the collected demographics. The downgrade process works as follows.
- When you first submit a request to collect sampled statistics, Teradata Database collects full statistics, which is equivalent to specifying a 100% sample. This provides the Optimizer with data it can use to determine when to downgrade to a smaller percentage when recollecting statistics.
- On subsequent requests to recollect statistics, the Optimizer collects full statistics until it determines that it has captured enough historical data to enable it to know when to downgrade the sampling percentage for later recollections.
- With this historical data, the Optimizer can determine whether a column is skewed, rolling, or static.
- The Optimizer then considers the column usage (detailed or summary data) it maintains in the DBC.StatsTbl.UsageType column and evaluates the number of intervals you have specified to judge when to downgrade from collecting full statistics to collecting sampled statistics.
Teradata Database is more aggressive about downgrading to sampled statistics for those histograms whose summary data is used, but whose detailed data is not.
- The Optimizer next determines a sampling percentage to use for recollecting statistics, ranging from 2% to 100%.
It determines the best formula to use to scale the sampling percentage based on the history and nature of the column.
- In the final stage of the process, Teradata Database compares the recollected statistics to the column history for quality. The Optimizer only lowers the sampling percentage for subsequent recollections if it knows that a smaller sampling percentage provides sufficiently high quality results.
This process removes the burden of a determining which columns can accurately use sampled statistics and which cannot. By collecting full statistics initially, the process provides the Optimizer with the information it needs to make intelligent decisions about how much to sample in subsequent recollections of statistics.
For additional information, see the documentation for “COLLECT STATISTICS (Optimizer Form)” in SQL Data Definition Language.