Optimal Times to Collect or Recollect Statistics | Optimizer Process | Vantage - Optimal Times to Collect or Recollect Statistics - Advanced SQL Engine - Teradata Database

SQL Request and Transaction Processing

Product
Advanced SQL Engine
Teradata Database
Release Number
17.10
Published
July 2021
Language
English (United States)
Last Update
2021-07-28
dita:mapPath
uqf1592445067244.ditamap
dita:ditavalPath
uqf1592445067244.ditaval
dita:id
B035-1142
lifecycle
previous
Product Category
Teradata Vantage™

The choice of collecting full-table statistics, some form of sampled statistics, or no statistics is yours to make as long as you understand that the method that provides the best table statistics over the long run is collecting and recollecting full-table statistics frequently. You also have the choice of specifying a sampling percentage, an elapsed number of days threshold, or a data demographics change percentage threshold, which instruct Vantage not to recollect statistics unless the specified thresholds are met or exceeded. Also be aware that table cardinalities are updated whenever rows in a table are updated or deleted or when new rows are added to a table by the Object Use Count and UDI system, so the Optimizer has access to accurate cardinality information whether statistics are current or not. For details about these systems, see Object Use and UDI Counts.

Teradata recommends always collecting full-table statistics on a regular basis unless you specify USING clause sampling or threshold options. In this case, the Optimizer determines whether a request to recollect statistics for a column or index set should be honored. The responsibility for submitting the COLLECT STATISTICS request remains yours, but if you specify one or more of the sampling or threshold options, Vantage makes the decision whether to recollect the specified statistics based on various historical criteria for the relevant histogram. For more information, see Recollecting Sampled Statistics and the information about COLLECT STATISTICS (Optimizer Form) in Teradata Vantage™ - SQL Data Definition Language Syntax and Examples, B035-1144.

Just how frequently statistics should be recollected is contingent on several factors, and if you take advantage of the sampling and threshold options that are available to you when you initially collect statistics, you need not be concerned about determining when it is necessary to recollect statistics because Vantage makes that determination for you.

To avoid making this decision arbitrarily, you should submit your requests to collect and recollect statistics using one or more of the sampling or threshold options that are described briefly in this topic and more extensively in the documentation for the COLLECT STATISTICS (Optimizer Form) statement in Teradata Vantage™ - SQL Data Definition Language Syntax and Examples, B035-1144 . Then all you need to do is submit your requests to recollect statistics and Vantage makes the determination of whether statistics need to be recollected or not based on the threshold and sampling criteria you have established. If the system determines that statistics need not be collected, it does not honor your request to recollect them, and you do not need to worry about recollecting statistics unnecessarily.

Depending on the various qualitative and quantitative changes in your column and index demographics, residual statistics can be just as good as freshly collected statistics (see Relative Accuracy of Residual Statistics Versus Dynamic AMP Sampled Statistics for Static Columns for a description of some of the factors to consider when making this evaluation).

Of course, it is up to you to determine what methods work best in the various situations encountered in your production environment. You might decide that a single approach is not good enough for all your different tables. All-AMPs sampled statistics might provide sufficient accuracy for your very large tables, enabling you to avoid expending the system resources that collecting full-table statistics might consume.

Keep in mind that the operational definition of good statistics is those statistics that produce an optimal query plan. How and when you collect statistics on your table columns and indexes depends on your definition of an optimal query plan.

Note that with the exception of statistics obtained by a dynamic AMP sample, statistics are collected globally, so are not affected by reconfiguring your system. In other words, all things being equal, there is no need to recollect statistics after you reconfigure your system.

Using the Teradata Viewpoint Stats Manager

The Teradata Viewpoint Stats Manager portlet provides a method of determining when fresh statistics should be collected.

See Teradata Viewpoint Stats Manager and the Teradata® Viewpoint User Guide, B035-2206 for information about the uses and capabilities of this portlet.

Policies for Collecting Statistics

The following table lists some suggested policies for collecting statistics. Each policy is rated as recommended or strongly recommended.

Following such policies is not as critical if you collect statistics using thresholds because Vantage does not recollect statistics when the thresholds you specify are not met. You must still submit COLLECT STATISTICS requests, but statistics are not recollected unless the specified thresholds are met or exceeded. This prevents you from wasting valuable system resources to recollect unnecessary statistics.

Policy Required or Recommended
Recollect all statistics when you upgrade to a new database release. Strongly recommended.
By the Ten Percent Rule, you should recollect statistics whenever table or partition demographics change by 10% or more. This rule applies to both row partitioning and column partitioning.
  • For a nonpartitioned table, recollect statistics whenever the demographics of the table change by 10% or more.
  • For a row-partitioned table, recollect statistics whenever the demographics of the row partitions change by 10% or more.
  • For a column-partitioned table, recollect statistics whenever the demographics of the column partition change by 10% or more.

For high volumes of very nonunique values such as dates or timestamps, you should consider recollecting statistics when the population changes by as little as 7%.

Strongly recommended.
Specify appropriate sampling or threshold options or both when you collect and recollect statistics. This enables the Optimizer to determine whether fresh statistics need to be collected or not. See COLLECT STATISTICS (Optimizer Form) information in Teradata Vantage™ - SQL Data Definition Language Syntax and Examples, B035-1144 for information about how to specify sampling and threshold options in the USING clause of COLLECT STATISTICS requests. Strongly recommended.
Collect statistics on newly created, empty tables to create the synoptic data structures for subsequent collection of statistics. Recommended.
Recollect statistics whenever the number of rows per distinct value is less than 100. Recommended.

The following table provides more specific recommendations for collecting statistics. You should consider this set of recommendations to be both minimal and essential.

FOR this category of table … You should collect statistics on …
all all columns used in join conditions.
large all NUPIs.
small

In this context, a small table is defined as a table whose cardinality is less than 5 times the number of AMPs in the system. For a 20 AMP system, table cardinality would have to be less than 100 rows for the table to be considered small, for a 100 AMP system, less than 500 rows, and so on

the primary index.
To ensure the best query plans, you should consider collecting statistics on the following, more general set of table columns.
  • All indexes.
  • High-access join columns.
  • Non-indexed columns frequently referenced in WHERE clause predicates, particularly if those columns contain skewed data.
  • The partitioning column set of a row-partitioned table.
  • The system-defined PARTITION column for all partitioned (column or row) tables.

    The value for the column partition number of the system-derived column PARTITION for a column-partitioned table is always 1, so collected statistics on a PARTITION column equate to the number of rows in each combined row partition.