Functions of Optimizer Statistics and Demographics
The Optimizer uses statistics and demographics for several different purposes. Without full or all-AMP sampled statistics, query optimization must rely on extrapolation (see Using Extrapolation to Replace Stale Statistics), object use counts (see Object Use and UDI Counts), or dynamic AMP sample estimates of table cardinality, which does not collect all of the statistics that a COLLECT STATISTICS request does.
- How accurate must the available statistics be in order to generate the best possible query plan?
- How poor a query plan are you willing to accept?
It is extremely important to have reasonably accurate statistical and demographic data about your tables at all times.
For example, if the statistics for a table are stale, the Optimizer might estimate the cardinality after one processing step as 10,000 rows, when the actual query returns only 15 rows at that step. The remaining steps in the query plan are all thrown off by the misestimation that led to the result of step 1, and the performance of the request is suboptimal.
- By collecting object use and UDI counts (see Object Use and UDI Counts).
- By saving the dynamic AMP sample statistics, PARTITION
statistics, or both, as attributes in the interval histogram. PARTITION
statistics can be collected more quickly than other statistics and can be
collected for row- and column-partitioned tables and for nonpartitioned
tables.
Because dynamic AMP statistics for a table are always collected from the same AMP set (see Dynamic AMP Sampling), they can also be used to detect table growth with a high degree of accuracy (see Using Extrapolated Cardinality Estimates to Assess Table Growth).
Purposes of Statistics and Demographics
- The Optimizer uses statistics and demographics to determine whether it should generate a query plan that uses an index instead of performing a full-table scan.
- The Optimizer uses statistics and demographics to estimate
the cardinalities of intermediate spools based on the qualifying conditions
specified by the request.
The estimated cardinality of intermediate results is critical for the determination of both optimal join orders for tables and the kind of join method that should be used to make those joins.
For example, should 2 tables or spools be redistributed and then merge joined, or should one of the tables or spools be duplicated and then Product Joined with the other. Depending on how accurate the statistics are, the generated join plan can vary so greatly that the same query can take only seconds to complete using one join plan, but take hours to complete using another.
- Statistics collected on the PARTITION system-derived column
permit the Optimizer to better estimate costs and cardinalities. This is
also true for nonpartitioned tables and column-partitioned tables.
Vantage also uses PARTITION statistics for estimates when predicates are based on the PARTITION column, for example WHERE PARTITION IN (3,4).
Because the system usually can collect SUMMARY statistics very rapidly, you should collect these statistics for all tables when Vantage updates them after ETL jobs complete. You should also recollect PARTITION statistics for row-partitioned tables after ETL jobs complete or any time that you modify a partitioning expression using an ALTER TABLE request.
This is another reason that you should keep your statistics as current as you can.
For information about the Viewpoint Stats Manager, see the Teradata® Viewpoint User Guide, B035-2206.
In some cases, dynamic all-AMP (see Sampled Statistics) or dynamic single-AMP sampled statistics (see Dynamic AMP Sampling) are not accurate enough for the Optimizer to generate an optimal, or even a good, join plan. However, it is often true that statistics collected by sampling small subpopulations of table rows can be as good as those collected using a full-table scan.
The value of collecting full-table statistics is that they provide the Optimizer with the most accurate information that can be gathered for making the best possible plan cost estimates. In most cases, the creation and propagation of derived statistics by the Optimizer can balance all but the most stale statistics (see Derived Statistics).
Statistical accuracy is fundamentally important for any plan because the effect of suboptimal access and join plans generated from inaccurate statistics, of which there can be many in the optimization of a complex query, is multiplicative.
A plan generated using full-table statistics is expected to be at least as good as a plan generated using any form of sampled statistics. There is a high probability that a plan based on full-table statistics will be better, and sometimes significantly better, than a plan based on any form of sampled statistics.
Automatic Collection and Recollection of Statistics
The THRESHOLD options for the COLLECT STATISTICS statement enable you to set periodic recollection thresholds for recollecting statistics. Importantly, if the thresholds you set are not met, the existing statistics are considered not to be stale, so Vantage does not recollect them even if you submit an explicit COLLECT STATISTICS request. The metadata that Vantage uses to manage the automatic recollection of statistics is stored in several tables in the TDSTATS database.
Several features work with the COLLECT STATISTICS statement to manage the statistics that you collect, including the STATSUSAGE and USECOUNT options for the BEGIN QUERY LOGGING and REPLACE QUERY LOGGING statements and a number of APIs that can analyze database objects to identify situations where statistics management can be improved by the system. For documentation of these APIs, see Teradata Vantage™ - Application Programming Reference, B035-1090.
Relative Benefits of Collecting Full-Table and Sampled Statistics
When viewed in isolation, the decision between full-table and all-AMPs sampled statistics is a simple one: always collect full-table statistics, because they provide the best opportunity for producing optimal query plans.
While statistics collected from a full-table scan are an accurate representation of the entire domain, an all-AMPs sample estimates statistics based on a small sample of the domain, and a dynamic AMP sample is not only based on an even smaller, more cardinality- and skew-sensitive sample of the domain, it also estimates fewer statistics.
Unfortunately, this decision is not so easily made in a production environment. Other factors must be accounted for, including the length of time required to collect the statistics and the resource consumption burden the collection of full-table statistics incurs.
The optimal compromise is provided by collecting object use and UDI counts (see Object Use and UDI Counts). The Optimizer can make a very accurate estimate of current table cardinality by adding the current INSERT count to the current cardinality estimate from residual collected statistics and then subtracting the current DELETE count from that sum.
In a production environment running multiple heavy query workloads, the problem concerns multiple levels of optimization.
Level | Type of Optimization | Considerations |
---|---|---|
Bottom | Query | If collecting full-table statistics makes queries run faster, what reasons could there be for collecting less accurate statistical samples? |
Middle | Workload | If the act of collecting full-table statistics makes the system run slower, is there any way to collect statistical samples of table populations that are reasonably accurate and that will produce reasonably good query plans? |
Top | Mix | What mix of query and workload optimization is best for overall system performance? |
The following table compares the various characteristics of the three methods of collecting statistics and documents their respective most productive uses:
Method | Characteristics | Best Use |
---|---|---|
Full statistics |
|
|
Sampled statistics |
|
|
Dynamic AMP sample |
|
|
To avoid recollecting statistics when the existing statistics are still fresh, you should use one or more of the THRESHOLD options that you can specify for a COLLECT STATISTICS request. When you specify a threshold that must be met for statistics on an index or column set to be recollected, you enable Vantage to use various methods to determine whether the statistics it already has have become stale or not. If the existing statistics meet your criteria for freshness, Vantage ignores a request you submit to recollect those statistics and only recollects them when they meet your specified criteria. You can specify similar criteria for recollecting statistics using sampling rather than a full-table scan. The Optimizer uses these directives together with object use count information to determine when unused statistics should be recollected or dropped. See Object Use and UDI Counts and the information about COLLECT STATISTICS (Optimizer Form) in Teradata Vantage™ - SQL Data Definition Language Syntax and Examples, B035-1144 for further information about using thresholds and sampling criteria with your recollections of index and column statistics.
Using Interval Histogram Statistics or a Dynamic AMP Sample for NUSI Subtable Statistics
- The number of NUSI subtable index rows in an AMP is the
number of distinct values in the index columns.
The Optimizer assumes that all of the distinct values it samples from a single AMP exist on all of the AMPs in the system.
- If a NUSI has an interval histogram, and if that histogram
has fewer distinct values than a dynamic AMP sample gathers, then the system
increases the number of index values to the number of values obtained from
the dynamic AMP sample, with the exception of the primary index.
This counters the problem of stale NUSI statistics.
- The system sets the number of distinct values for unique indexes to the table cardinality by overriding the values from either the interval histogram or from a dynamic AMP sample.
The implication of this is that you need not refresh the statistics for a NUSI each time new data is loaded into its base table as long as the distribution of the primary index is fairly uniform.
- Tables have a high correlation between the primary index and NUSI column sets.
- Workloads contain frequently submitted queries with
single-table predicates.
The Optimizer needs interval histograms to make accurate single-table cardinality estimates.
Time and Resource Consumption Factors in Deciding How to Collect Statistics
The elapsed time to collect statistics and the resources consumed in doing so are the principal factors that mitigate collecting statistics on the entire population of a table rather than collecting what are probably less accurate statistics from a sampled subset of the full population which, in turn, will possibly result in less optimal query plans.
You cannot collect sampled single-column PARTITION statistics. The system allows you to submit such a request, but does not honor it. Instead, Vantage sets the sampling percentage to 100%. Sampled collection is permitted for multicolumn PARTITION statistics.
- Base table cardinality
- Number of distinct index or non-indexed column values
- System configuration
If you collect statistics on multiple columns and indexes of the same table, the process can easily take two to four times longer hours to complete, compared to collecting statistics on a single column. When you add the time required to collect statistics for numerous smaller tables in the database to the mix, the time required to collect statistics for all the tables can be surprisingly large. You might even decide that the necessary time for collection is excessive for your production environment, particularly if you have a narrow time window for collecting statistics.
Collecting full-table statistics is not just time consuming; it also places a performance burden on the system, consuming CPU and disk I/O resources that would otherwise be devoted to query processing and other application workloads. You might decide that collecting statistics places too many burdens on system resources to justify recollecting statistics on a daily, weekly, or even monthly basis.
After examining all these considerations, you might even conclude that any recollecting of statistics is an unacceptable burden for your production environment.
Because collecting and recollecting statistics can be so time consuming, you should consider specifying one or more of the sampling or threshold options (or both) when you first collect statistics on a table.
These options enable the Optimizer to decide when statistics have become stale rather than forcing a database administrator into the position of making the determination. When these collection options are in place, the Optimizer does not honor requests to recollect statistics that are not stale, so you can submit COLLECT STATISTICS more frequently, knowing that if the Optimizer does not need to freshen the statistics it requires to optimize a request, it does not recollect those statistics needlessly. See the information about COLLECT STATISTICS (Optimizer Form) in Teradata Vantage™ - SQL Data Definition Language Syntax and Examples, B035-1144 for more information about the USING clause sampling and threshold options.
An Example of How Stale Statistics Can Produce a Poor Query Plan
The optimal way to determine whether statistics that have been collected on a database object are stale is to use the THRESHOLD logic of the COLLECT STATISTICS (Optimizer Form) statement (see Teradata Vantage™ - SQL Data Definition Language Syntax and Examples, B035-1144).
The following example is extreme, but is instructive in demonstrating how negatively bad statistics can affect the query plan the Optimizer generates.
Consider two tables, A and B:
Table Name | Statistics Collected? |
Cardinality When Statistics Were Collected | Current Cardinality |
---|---|---|---|
A | Yes | 1,000 | 1,000,000 |
B | No | unknown | 75,000 |
If a Product Join between table A and table B is necessary for a given query, and one of the tables must be duplicated on all AMPs, then the Optimizer will select table A to be duplicated because, as far as it knows from the available statistics, only 1,000 rows must be redistributed, as opposed to the far greater 75,000 rows from table B.
In reality, table A currently has a cardinality that is three orders of magnitude larger than its cardinality at the time its statistics were collected: one million rows, not one thousand rows, and the Optimizer makes a very bad decision by duplicating the million rows of table A instead of the 75,000 rows of table B). As a result, the query runs much longer than necessary.
- The number of rows in the table has changed significantly.
- The range of values for an index or a column of a table for
which statistics have been collected has changed significantly.
Sometimes you can infer this from the date and time the statistics were last collected, or by the nature of the column. For example, if the column in question stores transaction dates, and statistics on that column were last gathered a year ago, it is certain that the statistics for that column are stale.
The best way to handle this is to enable the Optimizer to determine when statistics are stale by specifying one or more threshold options when you collect or recollect statistics on a table or single-table view. Collecting and recollecting statistics using one or more of the threshold options makes it possible for the Optimizer to reject recollecting statistics when it determines that the current statistics are not stale.
You can obtain the number of unique values for each statistic on a table, as well as the date and time the statistics were last gathered, using the following request:
HELP STATISTICS table_name;
For statistics on unique indexes, you can cross check values reported by HELP STATISTICS by comparing the row count returned by the following query:
SELECT COUNT(*) FROM table_name;
For statistics on nonunique columns, you can cross check the HELP STATISTICS report by comparing it with the count returned by the following query:
SELECT COUNT(DISTINCT columnname) FROM table_name;
Teradata Viewpoint Stats Manager
The Stats Manager portlet allows you to manage statistics collection, which includes the ability to collect and analyze statistics, create and control jobs, and manage recommendations.
- View statistics on your system
- Identify statistics to be collected and schedule statistic collection
- Identify and collect missing statistics
- Detect and refresh stale statistics
- Identify and discontinue collecting unused statistics
- View when statistics were last collected and are scheduled for collection again
See Teradata® Viewpoint User Guide, B035-2206 for information about the uses and capabilities of this portlet.