The Optimizer uses statistics and demographics for several different purposes. Without full or all‑AMP sampled statistics, query optimization must rely on extrapolation (see “Using Extrapolation to Replace Stale Statistics” on page 288), object use counts (see “Object Use and UDI Counts” on page 276), or dynamic AMP sample estimates of table cardinality, which does not collect all of the statistics that a COLLECT STATISTICS request does.
Besides estimated cardinalities, dynamic AMP samples also collect a few other statistics, but far fewer than are collected by a COLLECT STATISTICS request.
Statistics and demographics provide the Optimizer with information it uses to reformulate queries in ways that permit it to produce the least costly access and join plans. The critical issues you must evaluate when deciding whether to collect statistics are not whether query optimization can or cannot occur in the face of inaccurate statistics, but the following pair of antagonistic questions:
It is extremely important to have reasonably accurate statistical and demographic data about your tables at all times.
For example, if the statistics for a table are stale, the Optimizer might estimate the cardinality after one processing step as 10,000 rows, when the actual query returns only 15 rows at that step. The remaining steps in the query plan are all thrown off by the misestimation that led to the result of step 1, and the performance of the request is suboptimal.
The Optimizer avoids many of the problems that histogram statistics have presented historically in two ways.
Because dynamic AMP statistics for a table are always collected from the same AMP set (see “Dynamic AMP Sampling” on page 181), they can also be used to detect table growth with a high degree of accuracy (see “Using Extrapolated Cardinality Estimates to Assess Table Growth” on page 290).
See “An Example of How Stale Statistics Can Produce a Poor Query Plan” on page 144 for a simple example of the extent to which a plan can be affected negatively by bad statistical data. Ioannidis and Christodoulakis (1991) have demonstrated that bad statistics propagate estimation errors in join planning at an exponential rate.
The following list is a very high‑level description of the most important purposes for column and index statistics and demographics:
The estimated cardinality of intermediate results is critical for the determination of both optimal join orders for tables and the kind of join method that should be used to make those joins.
For example, should 2 tables or spools be redistributed and then merge joined, or should one of the tables or spools be duplicated and then Product Joined with the other. Depending on how accurate the statistics are, the generated join plan can vary so greatly that the same query can take only seconds to complete using one join plan, but take hours to complete using another.
Teradata Database also uses PARTITION statistics for estimates when predicates are
based on the PARTITION column, for example WHERE PARTITION IN (3,4)
.
Because the system usually can collect SUMMARY statistics very rapidly, you should collect these statistics for all tables when Teradata Database updates them after ETL jobs complete. You should also recollect PARTITION statistics for rowpartitioned tables after ETL jobs complete or any time that you modify a partitioning expression using an ALTER TABLE request.
This is another reason that you should keep your statistics as current as you can.
For information about the Viewpoint Stats Manager, see the Teradata Viewpoint User Guide.
In some cases, dynamic all‑AMP (see “Sampled Statistics” on page 177) or dynamic single‑AMP sampled statistics (see “Dynamic AMP Sampling” on page 181) are not accurate enough for the Optimizer to generate an optimal, or even a good, join plan. However, it is often true that statistics collected by sampling small subpopulations of table rows can be as good as those collected using a full‑table scan.
The value of collecting full‑table statistics is that they provide the Optimizer with the most accurate information that can be gathered for making the best possible plan cost estimates. In most cases, the creation and propagation of derived statistics by the Optimizer can balance all but the most stale statistics (see “Derived Statistics” on page 210).
Statistical accuracy is fundamentally important for any plan because the effect of suboptimal access and join plans generated from inaccurate statistics, of which there can be many in the optimization of a complex query, is multiplicative. For example, see Ioannidis and Christodoulakis (1991), where a study of error propagation in join planning demonstrated that the error increases exponentially as a function of the number of joins in the queries being optimized. They write, “All types of error in an N‑way join grow exponentially with N.”).
A plan generated using full‑table statistics is expected to be at least as good as a plan generated using any form of sampled statistics. There is a high probability that a plan based on full‑table statistics will be better, and sometimes significantly better, than a plan based on any form of sampled statistics.
The THRESHOLD options for the COLLECT STATISTICS statement enable you to set periodic recollection thresholds for recollecting statistics. Importantly, if the thresholds you set are not met, the existing statistics are considered not to be stale, so Teradata Database does not recollect them even if you submit an explicit COLLECT STATISTICS request. The metadata that Teradata Database uses to manage the automatic recollection of statistics is stored in several tables in the TDSTATS database.
Several features work with the COLLECT STATISTICS statement to manage the statistics that you collect, including the STATSUSAGE and USECOUNT options for the BEGIN QUERY LOGGING and REPLACE QUERY LOGGING statements and a number of APIs that can analyze database objects to identify situations where statistics management can be improved by the system. For documentation of these APIs, see Application Programming Reference.
When viewed in isolation, the decision between full‑table and all‑AMPs sampled statistics is a simple one: always collect full‑table statistics, because they provide the best opportunity for producing optimal query plans.
While statistics collected from a full‑table scan are an accurate representation of the entire domain, an all‑AMPs sample estimates statistics based on a small sample of the domain, and a dynamic AMP sample is not only based on an even smaller, more cardinality‑ and skew‑sensitive sample of the domain, it also estimates fewer statistics.
Unfortunately, this decision is not so easily made in a production environment. Other factors must be accounted for, including the length of time required to collect the statistics and the resource consumption burden the collection of full‑table statistics incurs.
The optimal compromise is provided by collecting object use and UDI counts (see “Object Use and UDI Counts” on page 276). The Optimizer can make a very accurate estimate of current table cardinality by adding the current INSERT count to the current cardinality estimate from residual collected statistics and then subtracting the current DELETE count from that sum.
In a production environment running multiple heavy query workloads, the problem concerns multiple levels of optimization.
Level 
Type of Optimization 
Considerations 
Bottom 
Query 
If collecting fulltable statistics makes queries run faster, what reasons could there be for collecting less accurate statistical samples? 
Middle 
Workload 
If the act of collecting fulltable statistics makes the system run slower, is there any way to collect statistical samples of table populations that are reasonably accurate and that will produce reasonably good query plans? 
Top 
Mix 
What mix of query and workload optimization is best for overall system performance? 
The following table compares the various characteristics of the three methods of collecting statistics and documents their respective most productive uses:
Method 
Characteristics 
Best Use 
Full statistics 


Sampled statistics 

Experience suggests that sampled statistics are useful for very large tables; meaning tables with tens of billions of rows. 
Dynamic AMP sample 
Statistics estimated include the following for all columns: For nonunique secondary indexes only, the following additional statistics are estimated: Cardinality is not refreshed by individual INSERT or DELETE requests even if the sum of their updates exceed the 10% threshold. 
Dynamic AMP sampling provides a reasonable fallback mechanism for supporting the optimization of newly devised ad hoc queries until you understand where collected statistics are needed to support query plans for them. Teradata Database stores cardinality estimates from dynamic AMP samples in the interval histogram for estimating table growth even when complete, fresh statistics are available. 
To avoid recollecting statistics when the existing statistics are still fresh, you should use one or more of the THRESHOLD options that you can specify for a COLLECT STATISTICS request. When you specify a threshold that must be met for statistics on an index or column set to be recollected, you enable Teradata Database to use various methods to determine whether the statistics it already has have become stale or not. If the existing statistics meet your criteria for freshness, Teradata Database ignores a request you submit to recollect those statistics and only recollects them when they meet your specified criteria. You can specify similar criteria for recollecting statistics using sampling rather than a full‑table scan. The Optimizer uses these directives together with object use count information to determine when unused statistics should be recollected or dropped. See “Object Use and UDI Counts” on page 276 and “COLLECT STATISTICS (Optimizer Form)” in SQL Data Definition Language for further information about using thresholds and sampling criteria with your recollections of index and column statistics.
The following flowchart diagrams the process by which the Optimizer looks for statistics when it is creating a query plan for a request:
The general process stages the Optimizer follows while performing this task are as follows:
1 On receiving a parsed and rewritten request, the Optimizer looks for the data block descriptor (DBD) in the dictionary cache.
IF the table DBD is … 
THEN the Optimizer … 
in the dictionary cache 
extracts the dynamic AMP samples it needs. 
not in the dictionary cache 
reads the DBD from disk, then extracts the dynamic AMP samples it needs. 
Each time a table header is read from disk, Teradata Database collects a fresh dynamic AMP sample of its statistics, places it in the appropriate DBD, which is then cached in the dictionary cache along with the table header, and stores the new cardinality estimate in the interval histogram for the table (when Teradata Database caches histogram statistics, it writes them to the statistics cache rather than the dictionary cache).
Remember that the statistics derived from a dynamic AMP sample are not as complete as those gathered by a COLLECT STATISTICS request and stored in interval histograms in the dictionary.
2 The Optimizer calls a routine that looks for collected statistics on each column set or index it needs whenever it requires statistics to build a query plan.
a Teradata Database searches the statistics cache for the relevant interval histogram.
b If the desired interval histogram is not cached, Teradata Database issues an express request that retrieves it from the appropriate dictionary tables on disk.
Teradata Database stores the histogram data in the DBC.StatsTbl dictionary table and stores metadata for every automated collection of statistics in various tables in the TDSTATS database, which you can access using the appropriate API procedures documented in Application Programming Reference. Automated collection of statistics is controlled by the THRESHOLD option of the COLLECT STATISTICS statement.
3 If no statistics have been collected for the desired column set, the Optimizer proceeds as outlined in the following table:
IF the desired histogram is … 
THEN … 

located 
place it in the statistics cache. 

not located 
determine if the column set represents an index. 
4 Purge the statistics from the statistics cache when the Optimizer recollects statistics.
Because the statistics contained in the interval histograms for a NUSI subtable (see “Interval Histograms” on page 162) can be so stale that a dynamic AMP sample would produce more accurate cardinality estimates, the Optimizer bases its selection of which statistics to use based on the following set of rules.
The Optimizer assumes that all of the distinct values it samples from a single AMP exist on all of the AMPs in the system.
This counters the problem of stale NUSI statistics.
The implication of this is that you need not refresh the statistics for a NUSI each time new data is loaded into its base table as long as the distribution of the primary index is fairly uniform.
Note: This does not relieve you of the responsibility of collecting or refreshing NUSI statistics for the following.
The Optimizer needs interval histograms to make accurate single‑table cardinality estimates.
The GeoGrid structure that Teradata Database uses to collect population statistics on both geospatial columns and geospatial NUSIs is stored in the DBC.StatsTbl table as part of the statistics histogram (for further information, see “GeoGrid Population Statistics” on page 158).
Teradata Database supports NUSIs on geospatial columns using GeoGrids for system‑wide population statistics (for information about GeoGrids and the statistics they maintain, see “GeoGrid Population Statistics” on page 158) and Hilbert R‑trees for summary statistics (see “Hilbert R‑Tree Summary Statistics” on page 160). The Hilbert R‑tree is a indexing structure for multidimensional objects that is organized as a tree with n levels. It is built from a multilevel hierarchy of n1 directory levels stemming from a single root, or header, block and 1 leaf level. The hierarchy is constructed using directory blocks containing directory rows and leaf blocks containing leaf rows.
A system routine generates a linear value for each MBR in a Hilbert R‑tree called a Hilbert value. The Hilbert value for an MBR is associated with its center point, and the Hilbert R‑tree in which a Hilbert value is generated uses the value to make navigational decisions within the geospatial NUSI it defines based on which Hilbert value associated with a directory block row is closest to the Hilbert value associated with the entry being inserted into the index.
The following graphic provides a simplified drawing of an R‑tree in the context of Teradata NUSIs.
The directory rows contain pointers to the child blocks in the level below them. Both the directory rows and the leaf rows have predetermined fixed footprints. The directory blocks and leaf blocks are both have maximum limits on the number of rows that each block can contain. As more and more rows are inserted into the tree, the individual blocks to contain them must be split each time the maximum row count is exceeded.
R‑trees grow from the bottom up, meaning that at first leaf blocks are split; then additional rows must be added to the parent directory block to accommodate the split leaf blocks. The parent block directory must eventually be split, and so on.
The Teradata Database Hilbert R‑tree is built on top of the file system. Complex indexes, such as the Hilbert R‑tree, are typically realized as a collection of blocks, containing rows, which contain pointers to other blocks.
The elapsed time to collect statistics and the resources consumed in doing so are the principal factors that mitigate collecting statistics on the entire population of a table rather than collecting what are probably less accurate statistics from a sampled subset of the full population which, in turn, will possibly result in less optimal query plans.
You cannot collect sampled single‑column PARTITION statistics. The system allows you to submit such a request, but does not honor it. Instead, Teradata Database sets the sampling percentage to 100%. Sampled collection is permitted for multicolumn PARTITION statistics.
The elapsed time required to collect statistics varies as a function of the following factors.
For a 32node production system, collecting statistics on a single column of a 100 million row table might take an hour. If you collect statistics on multiple columns and indexes of the same table, the process can easily take 4 hours to complete. When you add the time required to collect statistics for numerous smaller tables in the database to the mix, the time required to collect statistics for all the tables can be surprisingly large. You might even decide that the necessary time for collection is excessive for your production environment, particularly if you have a narrow time window for collecting statistics.
Collecting full‑table statistics is not just time consuming; it also places a performance burden on the system, consuming CPU and disk I/O resources that would otherwise be devoted to query processing and other application workloads. You might decide that collecting statistics places too many burdens on system resources to justify recollecting statistics on a daily, weekly, or even monthly basis.
After examining all these considerations, you might even conclude that any recollecting of statistics is an unacceptable burden for your production environment.
Because collecting and recollecting statistics can be so time consuming, you should consider specifying one or more of the sampling or threshold options (or both) when you first collect statistics on a table.
These options enable the Optimizer to decide when statistics have become stale rather than forcing a database administrator into the position of making the determination. When these collection options are in place, the Optimizer does not honor requests to recollect statistics that are not stale, so you can submit COLLECT STATISTICS more frequently, knowing that if the Optimizer does not need to freshen the statistics it requires to optimize a request, it does not recollect those statistics needlessly. See “COLLECT STATISTICS (Optimizer Form)” in SQL Data Definition Language for more information about the USING clause sampling and threshold options.
The optimal way to determine whether statistics that have been collected on a database object are stale is to use the THRESHOLD logic of the COLLECT STATISTICS (Optimizer Form) statement (see “COLLECT STATISTICS (Optimizer Form)” in SQL Data Definition Language).
The following example is admittedly extreme, but is instructive in demonstrating how negatively bad statistics can affect the query plan the Optimizer generates.
Consider the following 2 tables.
Table Name 
Statistics Collected? 
Cardinality When Statistics Were Collected 
Current Cardinality 
A 
Yes 
1,000 
1,000,000 
B 
No 
unknown 
75,000 
If a Product Join between table A and table B is necessary for a given query, and one of the tables must be duplicated on all AMPs, then the Optimizer will select table A to be duplicated because, as far as it knows from the available statistics, only 1,000 rows must be redistributed, as opposed to the far greater 75,000 rows from table B.
In reality, table A currently has a cardinality that is three orders of magnitude larger than its cardinality at the time its statistics were collected: one million rows, not one thousand rows, and the Optimizer makes a very bad decision by duplicating the million rows of table A instead of the 75,000 rows of table B). As a result, the query runs much longer than necessary.
There are 2 general circumstances under which statistics can be considered to be stale.
Sometimes you can infer this from the date and time the statistics were last collected, or by the nature of the column. For example, if the column in question stores transaction dates, and statistics on that column were last gathered a year ago, it is certain that the statistics for that column are stale.
The best way to handle this is to enable the Optimizer to determine when statistics are stale by specifying one or more threshold options when you collect or recollect statistics on a table or single‑table view. Collecting and recollecting statistics using one or more of the threshold options makes it possible for the Optimizer to reject recollecting statistics when it determines that the current statistics are not stale.
You can obtain the number of unique values for each statistic on a table, as well as the date and time the statistics were last gathered, using the following request.
HELP STATISTICS table_name;
For statistics on unique indexes, you can cross check values reported by HELP STATISTICS by comparing the row count returned by the following query.
SELECT COUNT(*)
FROM table_name;
For statistics on nonunique columns, you can cross check the HELP STATISTICS report by comparing it with the count returned by the following query.
SELECT COUNT(DISTINCT columnname)
FROM table_name;
The Stats Manager portlet allows you to manage Teradata Database statistics collection, which includes the ability to collect and analyze statistics, create and control jobs, and manage recommendations.
You can use Stats Manager to perform the following tasks:
See Teradata Viewpoint User Guide for information about the uses and capabilities of this portlet.