The choice of collecting full-table statistics, some form of sampled statistics, or no statistics is yours to make as long as you understand that the method that provides the best table statistics over the long run is collecting and recollecting full-table statistics frequently. You also have the choice of specifying a sampling percentage, an elapsed number of days threshold, or a data demographics change percentage threshold, which instruct Teradata Database not to recollect statistics unless the specified thresholds are met or exceeded. Also be aware that table cardinalities are updated whenever rows in a table are updated or deleted or when new rows are added to a table by the Object Use Count and UDI system, so the Optimizer has access to accurate cardinality information whether statistics are current or not. For details about these systems, see Object Use and UDI Counts.
Teradata recommends always collecting full-table statistics on a regular basis unless you specify USING clause sampling or threshold options. In this case, the Optimizer determines whether a request to recollect statistics for a column or index set should be honored. The responsibility for submitting the COLLECT STATISTICS request remains yours, but if you specify one or more of the sampling or threshold options, Teradata Database makes the decision whether to recollect the specified statistics based on various historical criteria for the relevant histogram. For more information, see Recollecting Sampled Statistics and “COLLECT STATISTICS (Optimizer Form)” in SQL Data Definition Language.
Just how frequently statistics should be recollected is contingent on several factors, and if you take advantage of the sampling and threshold options that are available to you when you initially collect statistics, you need not be concerned about determining when it is necessary to recollect statistics because Teradata Database makes that determination for you.
To avoid making this decision arbitrarily, you should submit your requests to collect and recollect statistics using one or more of the sampling or threshold options that are described briefly in this topic and more extensively in the documentation for the COLLECT STATISTICS (Optimizer Form) statement in SQL Data Definition Language . Then all you need to do is submit your requests to recollect statistics and Teradata Database makes the determination of whether statistics need to be recollected or not based on the threshold and sampling criteria you have established. If the system determines that statistics need not be collected, it does not honor your request to recollect them, and you do not need to worry about recollecting statistics unnecessarily.
Depending on the various qualitative and quantitative changes in your column and index demographics, residual statistics can be just as good as freshly collected statistics (see Relative Accuracy of Residual Statistics Versus Dynamic AMP Sampled Statistics for Static Columns for a description of some of the factors to consider when making this evaluation).
Of course, it is up to you to determine what methods work best in the various situations encountered in your production environment. You might decide that a single approach is not good enough for all your different tables. All-AMPs sampled statistics might provide sufficient accuracy for your very large tables, enabling you to avoid expending the system resources that collecting full-table statistics might consume.
Keep in mind that the operational definition of good statistics is those statistics that produce an optimal query plan. How and when you collect statistics on your table columns and indexes depends on your definition of an optimal query plan.
Note that with the exception of statistics obtained by a dynamic AMP sample, statistics are collected globally, so are not affected by reconfiguring your system. In other words, all things being equal, there is no need to recollect statistics after you reconfigure your system.
Using the Teradata Viewpoint Stats Manager
The Teradata Viewpoint Stats Manager portlet provides a method of determining when fresh statistics should be collected.
See Teradata Viewpoint Stats Manager and the Teradata Viewpoint User Guide for information about the uses and capabilities of this portlet.
Policies for Collecting Statistics
The following table lists some suggested policies for collecting statistics. Each policy is rated as recommended or strongly recommended.
Following such policies is not as critical if you collect statistics using thresholds because Teradata Database does not recollect statistics when the thresholds you specify are not met. You must still submit COLLECT STATISTICS requests, but statistics are not recollected unless the specified thresholds are met or exceeded. This prevents you from wasting valuable system resources to recollect unnecessary statistics.
|Policy||Required or Recommended|
|Recollect all statistics when you upgrade to a new Teradata Database release.||Strongly recommended.|
|By the Ten Percent Rule, you should recollect statistics whenever table or partition demographics change by 10% or more. This rule applies to both row partitioning and column partitioning.
For high volumes of very nonunique values such as dates or timestamps, you should consider recollecting statistics when the population changes by as little as 7%.
|Specify appropriate sampling or threshold options or both when you collect and recollect statistics. This enables the Optimizer to determine whether fresh statistics need to be collected or not. See “COLLECT STATISTICS (Optimizer Form)” in SQL Data Definition Language for information about how to specify sampling and threshold options in the USING clause of COLLECT STATISTICS requests.||Strongly recommended.|
|Collect statistics on newly created, empty tables to create the synoptic data structures for subsequent collection of statistics.||Recommended.|
|Recollect statistics whenever the number of rows per distinct value is less than 100.||Recommended.|
The following table provides more specific recommendations for collecting statistics. You should consider this set of recommendations to be both minimal and essential.
|FOR this category of table …||You should collect statistics on …|
|all||all columns used in join conditions.|
In this context, a small table is defined as a table whose cardinality is less than 5 times the number of AMPs in the system. For a 20 AMP system, table cardinality would have to be less than 100 rows for the table to be considered small, for a 100 AMP system, less than 500 rows, and so on
|the primary index.|
To ensure the best query plans, you should consider collecting statistics on the following, more general set of table columns.
- All indexes.
- High-access join columns.
- Non-indexed columns frequently referenced in WHERE clause predicates, particularly if those columns contain skewed data.
- The partitioning column set of a row-partitioned table.
- The system-defined PARTITION column for all partitioned (column or row) tables.
The value for the column partition number of the system-derived column PARTITION for a column-partitioned table is always 1, so collected statistics on a PARTITION column equate to the number of rows in each combined row partition.