Comparing Accuracies of Collecting Statistics | VantageCloud Lake - Comparing Accuracies of Methods of Collecting Statistics - Teradata Vantage

Teradata® VantageCloud Lake

Deployment
VantageCloud
Edition
Lake
Product
Teradata Vantage
Published
January 2023
Language
English (United States)
Last Update
2024-04-03
dita:mapPath
phg1621910019905.ditamap
dita:ditavalPath
pny1626732985837.ditaval
dita:id
phg1621910019905

Comparative Accuracy of Sampled and Population Statistics

The important thing to understand when considering the comparative likelihoods of the accuracy of statistics collected by the available methods is the consistently higher accuracy of population statistics over all forms of sampled statistics. Statistics collected using a method with a lesser probability of accuracy may be as good as those collected at any higher level of probable accuracy, but are never more accurate.

You cannot know whether you need to collect a full set of new statistics to make sure that the Optimizer produces the best query plans.

Ranking Relative Accuracies of Methods

The best comparative estimates of the relative accuracies of the statistics collected by the methods are described by the following ranked list.
  1. Dynamic AMP samples are better than residual statistics in the majority of cases.

    Dynamic AMP samples are also recollected each time a DBD is retrieved from disk, and therefore are typically more current than residual statistics.

  2. Dynamic all-AMPs samples are better than dynamic AMP samples in most cases.
  3. Full-table population statistics are typically better than sampled statistics.

The following table provides details to support these rankings. Each successively higher rank represents an increase in the accuracy of the statistics collected and a higher likelihood that the Optimizer produces a better query plan.

Collection Method Relative Elapsed Time to Collect Accuracy Rank (Higher Number = Higher Accuracy) Comments
None.

Use residual statistics

None. 1
  • Impossible to predict with certainty if residual statistics produce a good query plan.

    At worst, can produce a poor query plan.

    At best, can produce as good a query plan as freshly collected statistics.

  • Because statistics exist, Optimizer does not collect fresh statistics using a dynamic AMP sample.
  • Optimizer does use derived statistics, but their starting point is the existing statistics. The derived statistics subsystem has ways to compensate for stale statistics, but beginning with the freshest set of statistics that can be made available is better (see Using Derived Statistics to Compensate for Stale Statistics).
Dynamic AMP sample Almost none. 2
  • Data is collected from a subset of the rows on a single AMP and may not be representative of full table demographics.

    Collecting data from a subset of the rows on a single AMP is the default method. Depending on an internal variable, the default number of AMPs sampled ranges over 1, 2, 5, all AMPs on a node, or all AMPs on a system.

  • Collects statistics for table cardinality, average cardinality per value, average cardinality per index, average size of each index on each AMP, and number of distinct index values only.
  • Because the sample size is small, there is a high bias in the statistics.
  • Accuracy is sensitive to skew.
  • Provides acceptable estimates of base table and NUSI cardinality if table is large, distribution of values is not skewed, and data is not taken from an atypical sample.

    Other standard statistical parameters are less likely to be as accurate.

All-AMP sample Approximately 5% of time to perform a full-table scan.

When the data is skewed, this percentage is larger, depending on how much the system dynamically increases its sample size

3
  • Data is collected from a system-determined subset of the rows on all AMPs.
  • Collects identical statistics to full-table scan, including interval histogram creation.
  • Percentage of sampled rows is small.
  • Percentage of sampled rows is increased dynamically to enhance the accuracy of collected statistics if skew is detected in the samples.
Full-table scan Approximately 195% of the time to perform sampled statistics. 4
  • Data is collected from all rows on all AMPs so there is no sample bias.
  • Collects full statistics and creates interval histograms.
  • Skew is accounted for using high-biased intervals in the statistical histogram for a column or index.
  • Up to 100 histogram intervals are used, depending on the data.