Comparing the Accuracies of Methods of Collecting Statistics | Vantage - Comparing the Accuracies of Methods of Collecting Statistics - Advanced SQL Engine - Teradata Database

SQL Request and Transaction Processing

Product
Advanced SQL Engine
Teradata Database
Release Number
17.10
Published
July 2021
Language
English (United States)
Last Update
2021-07-28
dita:mapPath
uqf1592445067244.ditamap
dita:ditavalPath
uqf1592445067244.ditaval
dita:id
B035-1142
lifecycle
previous
Product Category
Teradata Vantageā„¢

Comparative Accuracy of Sampled and Population Statistics

The important thing to understand when considering the comparative a priori likelihoods of the accuracy of statistics collected by the various available methods is the consistently higher accuracy of population statistics over all forms of sampled statistics. It is always possible that statistics collected using a method with a lesser probability of accuracy will be as good as those collected at any given higher level of probable accuracy, but they will never be more accurate.

Although this is undeniable, it is not possible to know a priori whether it is necessary to collect a full set of new statistics in order to ensure that the Optimizer produces the best query plans.

Ranking the Relative Accuracies of the Various Methods

The best comparative estimates of the relative accuracies of the statistics collected by the various methods are described by the following ranked list.

  1. Dynamic AMP samples are better than residual statistics in the majority of cases.

    Dynamic AMP samples are also recollected each time a DBD is retrieved from disk, so they are nearly always more current than residual statistics.

  2. Dynamic all-AMPs samples are better than dynamic AMP samples in most cases.
  3. Full-table population statistics are usually better than any form of sampled statistics.

The following table provides some details to support these rankings. Each successively higher rank represents an increase in the accuracy of the statistics collected and a higher likelihood that the Optimizer will produce a better query plan because it has more accurate information to use for its estimates.

Collection Method Relative Elapsed Time to Collect Accuracy Rank (Higher Number = Higher Accuracy) Comments
None.

Use residual statistics

None. 1
  • Impossible to predict with certainty if residual statistics will produce a good query plan.

    At worst, can produce a very poor query plan.

    At best, can produce as good a query plan as freshly collected statistics.

  • Because statistics exist, Optimizer does not collect fresh statistics using a dynamic AMP sample.
  • Optimizer does use derived statistics, but their starting point is the existing statistics. The derived statistics subsystem has ways to compensate for stale statistics, but it is still better to begin with the freshest set of statistics that can be made available (see Using Derived Statistics to Compensate for Stale Statistics).
Dynamic AMP sample Almost none. 2
  • Data is collected from a subset of the rows on a single AMP and might not be representative of full table demographics.

    Collecting data from a subset of the rows on a single AMP is the default method. Depending on an internal DBS Control variable, the default number of AMPs sampled ranges over 1, 2, 5, all AMPs on a node, or all AMPs on a system. Consult your Teradata support representative for details.

  • Collects statistics for table cardinality, average cardinality per value, average cardinality per index, average size of each index on each AMP, and number of distinct index values only.
  • Because the sample size is small, there is a high bias in the statistics.
  • Accuracy is very sensitive to skew.
  • Provides fairly accurate estimates of base table and NUSI cardinality if table is large, distribution of values is not skewed, and data is not taken from an atypical sample.

    Other standard statistical parameters are less likely to be as accurate.

All-AMP sample Approximately 5% of time to perform a full-table scan.

When the data is skewed, this percentage is larger, depending on how much the system dynamically increases its sample size

3
  • Data is collected from a system-determined subset of the rows on all AMPs.
  • Collects identical statistics to full-table scan, including interval histogram creation.
  • Percentage of sampled rows is small.
  • Percentage of sampled rows is increased dynamically to enhance the accuracy of collected statistics if skew is detected in the samples.
Full-table scan Approximately 195% of the time to perform sampled statistics. 4
  • Data is collected from all rows on all AMPs so there is no sample bias.
  • Collects full statistics and creates interval histograms.
  • Skew is accounted for using high-biased intervals in the statistical histogram for a column or index.
  • Up to 100 histogram intervals are used, depending on the data.