Sampled Statistics | Optimizer Process | VantageCloud Lake - Sampled Statistics - Teradata Vantage

Teradata® VantageCloud Lake

Deployment
VantageCloud
Edition
Lake
Product
Teradata Vantage
Published
January 2023
Language
English (United States)
Last Update
2024-04-03
dita:mapPath
phg1621910019905.ditamap
dita:ditavalPath
pny1626732985837.ditaval
dita:id
phg1621910019905

Reasons to Use Sampled Statistics Instead of Full Statistics

Expending the effort required to collect full-table statistics may become difficult. For example, suppose you have a multibillion row table that requires hours of collection time, and a collection window too narrow for the operation. Instead, you can collect statistics on a sample of table rows.

You can specify multiple statistics sampling mechanisms with the USING option of a COLLECT STATISTICS request, but you cannot collect sampled statistics with a SAMPLE USING option if a column that you specify is a component of the partitioning expression of a row-partitioned table or column-partitioned table.

You can collect statistics on such a column using the COLLECT STATISTICS USING option if the specified column is both a member of the partitioning expression column set and a member of an index column set.

Sampled statistics are different from dynamic AMP samples in that you can specify the percentage of rows you want to sample explicitly in a COLLECT STATISTICS (Optimizer Form) request to collect sampled statistics, while the number of AMPs from which dynamic AMP samples are collected and the time when those samples are collected is determined by Vantage. You can also specify that Vantage must determine the sampling percentage of COLLECT STATISTICS samples. Furthermore, sampled statistics produce a full set of collected statistics, while dynamic AMP samples collect only a subset of the statistics that are stored in interval histograms (see Dynamic AMP Sampling for details).

Statistical sampling is known to be a reliable method of gathering approximate statistical estimates when the appropriate preconditions are met.

The quality of the statistics collected with full-table sampling is not guaranteed to be as good as the quality of statistics collected on an entire table without sampling. Do not think of sampled statistics as an alternative to collecting full-table statistics, but as an alternative to never, or rarely, collecting statistics.

When you use sampled statistics, statistics may be less accurate, but are better than no statistics.

Comparing Accuracy of Sampled Statistics and Dynamic AMP Samples

Do not confuse statistical sampling with the dynamic AMP samples that the Optimizer collects to create a query plan. Statistical samples taken across all AMPs are likely to be much more accurate than dynamic AMP samples.

The following table describes differences between the two methods of collecting statistics.

Statistical Sampling Dynamic AMP Sampling
Collects statistics on a small sample of rows from all AMPs.

If the columns are not indexed, the rows are organized randomly on each AMP, so Vantage only scans the first n percent of found rows, where the value of n is determined by the relative presence or absence of skew in the data. The entire sample can be taken from the first data block on each AMP, depending on the system configuration and cardinality of the table being sampled.

If the columns are indexed, then more sophisticated sampling is performed to take advantage of the hash-sequenced row ordering.

Collects statistics on a small sample of rows from a single AMP.

This is the system default.

Collects full statistics and stores them in interval histograms in DBC.StatsTblStatsTbl.

See also Interval Histograms.

Collects estimates for cardinality, number of distinct index values, and other (not all) statistics and stores the estimates in the data block descriptor.

Also collects and stores single-AMP and all-AMP cardinality estimates in interval histograms in DBC.StatsTblStatsTbl.

Expands sample size dynamically to compensate for skew. Sensitive to skew.
Provides close-to- accurate estimates of all statistical parameters. Provides close-to- accurate estimates of base table and NUSI cardinality if the following conditions are met.
  • The table is large
  • The distribution of values is not skewed
  • The data is not taken from an atypical sample

Other standard statistical parameters are less likely to be as accurate.

You cannot sample single-column PARTITION statistics at a level lower than 100%. You can submit a COLLECT STATISTICS request that specifies a lower percentage without the request ending and without getting an error message, but Vantage does not honor the specified percentage, and changes the sampling percentage to 100 internally. See COLLECT STATISTICS (Optimizer Form).

Recollecting Sampled Statistics

When you recollect statistics on a column or index, whether for the Optimizer or for a QCD analysis, Vantage uses the same method of collection to recollect the statistics as was used for the initial collection, either full-table scan or sampling.

The only exception is when you specify USING SYSTEM SAMPLE when you first collect statistics, but do not specify the FOR CURRENT option for a recollection. The USING SYSTEM SAMPLE option initially collects statistics using a full-table scan, but then downgrades the initial 100% sampling to increasingly lower sampling percentages until reaching a point where the sampling percentage is not reduced any further.

If you had collected statistics on the column as part of an index on a row-partitioned table, then Vantage follows the general rule and recollects sampled statistics on the index column.

To change the collection options, you must submit a new COLLECT STATISTICS request with the new collection options fully specified. This does not apply for sampled COLUMN statistics on a component of the partitioning expression for a row-partitioned table or column-partitioned table, which are not valid, and which you cannot specify in a COLLECT STATISTICS request. Vantage collects (and recollects) statistics on all the rows in the table for this case.

If you want to recollect statistics for a histogram, but override the options that have been saved for that histogram, you can specify the FOR CURRENT option. FOR CURRENT recollects statistics using whatever options you specify for the current COLLECT STATISTICS request only, and reverts to the saved options for subsequent recollections.

You can collect sampled statistics using either a system-determined sampling percentage or a user-specified sampling percentage. For a user-specified sample percentage, Vantage reads the specified percent of rows to collect statistics. All the existing rules apply if the sample percent is explicitly specified.

If you specify the USING SYSTEM SAMPLE option when you collect statistics, the Optimizer determines the appropriate time to switch from collecting full statistics to collecting sampled statistics. With this approach, the Optimizer decides both when to honor sampling and the degree of sampling, up to collecting full statistics, without sacrificing the quality of the collected demographics. The downgrade process works as follows.
  1. When you first submit a request to collect sampled statistics, Vantage collects full statistics, which is equivalent to specifying a 100% sample. This provides the Optimizer with data to determine when to downgrade to a smaller percentage when recollecting statistics.
  2. On subsequent requests to recollect statistics, the Optimizer collects full statistics until it has enough historical data to know when to downgrade the sampling percentage for later recollections.
  3. With this historical data, the Optimizer can determine whether a column is skewed, rolling, or static.
  4. The Optimizer then considers the column usage (detailed or summary data) in the DBC.StatsTbl.UsageType column and evaluates the number of intervals you have specified to judge when to downgrade from collecting full statistics to collecting sampled statistics.

    Vantage is more aggressive about downgrading to sampled statistics for those histograms whose summary data is used, but whose detailed data is not.

  5. The Optimizer next determines a sampling percentage to use for recollecting statistics, ranging from 2% to 100%.

    The Optimizer determines the best formula to use to scale the sampling percentage based on the history and nature of the column.

  6. In the final stage of the process, Vantage compares the recollected statistics to the column history for quality. The Optimizer only lowers the sampling percentage for subsequent recollections if a smaller sampling percentage provides sufficiently high quality results.

This process removes the burden of determining which columns can accurately use sampled statistics. By collecting full statistics initially, the process provides the Optimizer with the information needed to make intelligent decisions about how much to sample in subsequent recollections of statistics.

For further detail, see COLLECT STATISTICS (Optimizer Form).