Optimizer Statistics and Demographics | Optimizer Process | Teradata Vantage - 17.10 - Optimizer Statistics and Demographics - Advanced SQL Engine - Teradata Database

Teradata Vantage™ - SQL Request and Transaction Processing

Product
Advanced SQL Engine
Teradata Database
Release Number
17.10
Release Date
July 2021
Content Type
Programming Reference
User Guide
Publication ID
B035-1142-171K
Language
English (United States)

Functions of Optimizer Statistics and Demographics

The Optimizer uses statistics and demographics for several different purposes. Without full or all-AMP sampled statistics, query optimization must rely on extrapolation (see Using Extrapolation to Replace Stale Statistics), object use counts (see Object Use and UDI Counts), or dynamic AMP sample estimates of table cardinality, which does not collect all of the statistics that a COLLECT STATISTICS request does.

Statistics and demographics provide the Optimizer with information it uses to reformulate queries in ways that permit it to produce the least costly access and join plans. The critical issues you must evaluate when deciding whether to collect statistics are not whether query optimization can or cannot occur in the face of inaccurate statistics, but the following pair of antagonistic questions:
  • How accurate must the available statistics be in order to generate the best possible query plan?
  • How poor a query plan are you willing to accept?

It is extremely important to have reasonably accurate statistical and demographic data about your tables at all times.

For example, if the statistics for a table are stale, the Optimizer might estimate the cardinality after one processing step as 10,000 rows, when the actual query returns only 15 rows at that step. The remaining steps in the query plan are all thrown off by the misestimation that led to the result of step 1, and the performance of the request is suboptimal.

The Optimizer avoids many of the problems that histogram statistics have presented historically in two ways.
  • By collecting object use and UDI counts (see Object Use and UDI Counts).
  • By saving the dynamic AMP sample statistics, PARTITION statistics, or both, as attributes in the interval histogram. PARTITION statistics can be collected more quickly than other statistics and can be collected for row- and column-partitioned tables and for nonpartitioned tables.

    Because dynamic AMP statistics for a table are always collected from the same AMP set (see Dynamic AMP Sampling), they can also be used to detect table growth with a high degree of accuracy (see Using Extrapolated Cardinality Estimates to Assess Table Growth).

Purposes of Statistics and Demographics

The following list is a very high-level description of the most important purposes for column and index statistics and demographics:
  • The Optimizer uses statistics and demographics to determine whether it should generate a query plan that uses an index instead of performing a full-table scan.
  • The Optimizer uses statistics and demographics to estimate the cardinalities of intermediate spools based on the qualifying conditions specified by the request.

    The estimated cardinality of intermediate results is critical for the determination of both optimal join orders for tables and the kind of join method that should be used to make those joins.

    For example, should 2 tables or spools be redistributed and then merge joined, or should one of the tables or spools be duplicated and then Product Joined with the other. Depending on how accurate the statistics are, the generated join plan can vary so greatly that the same query can take only seconds to complete using one join plan, but take hours to complete using another.

  • Statistics collected on the PARTITION system-derived column permit the Optimizer to better estimate costs and cardinalities. This is also true for nonpartitioned tables and column-partitioned tables.

    Vantage also uses PARTITION statistics for estimates when predicates are based on the PARTITION column, for example WHERE PARTITION IN (3,4).

    Because the system usually can collect SUMMARY statistics very rapidly, you should collect these statistics for all tables when Vantage updates them after ETL jobs complete. You should also recollect PARTITION statistics for row-partitioned tables after ETL jobs complete or any time that you modify a partitioning expression using an ALTER TABLE request.

    This is another reason that you should keep your statistics as current as you can.

For information about the Viewpoint Stats Manager, see the Teradata® Viewpoint User Guide, B035-2206.

In some cases, dynamic all-AMP (see Sampled Statistics) or dynamic single-AMP sampled statistics (see Dynamic AMP Sampling) are not accurate enough for the Optimizer to generate an optimal, or even a good, join plan. However, it is often true that statistics collected by sampling small subpopulations of table rows can be as good as those collected using a full-table scan.

The value of collecting full-table statistics is that they provide the Optimizer with the most accurate information that can be gathered for making the best possible plan cost estimates. In most cases, the creation and propagation of derived statistics by the Optimizer can balance all but the most stale statistics (see Derived Statistics).

Statistical accuracy is fundamentally important for any plan because the effect of suboptimal access and join plans generated from inaccurate statistics, of which there can be many in the optimization of a complex query, is multiplicative.

A plan generated using full-table statistics is expected to be at least as good as a plan generated using any form of sampled statistics. There is a high probability that a plan based on full-table statistics will be better, and sometimes significantly better, than a plan based on any form of sampled statistics.

Automatic Collection and Recollection of Statistics

The THRESHOLD options for the COLLECT STATISTICS statement enable you to set periodic recollection thresholds for recollecting statistics. Importantly, if the thresholds you set are not met, the existing statistics are considered not to be stale, so Vantage does not recollect them even if you submit an explicit COLLECT STATISTICS request. The metadata that Vantage uses to manage the automatic recollection of statistics is stored in several tables in the TDSTATS database.

Several features work with the COLLECT STATISTICS statement to manage the statistics that you collect, including the STATSUSAGE and USECOUNT options for the BEGIN QUERY LOGGING and REPLACE QUERY LOGGING statements and a number of APIs that can analyze database objects to identify situations where statistics management can be improved by the system. For documentation of these APIs, see Teradata Vantage™ - Application Programming Reference, B035-1090.

Relative Benefits of Collecting Full-Table and Sampled Statistics

When viewed in isolation, the decision between full-table and all-AMPs sampled statistics is a simple one: always collect full-table statistics, because they provide the best opportunity for producing optimal query plans.

While statistics collected from a full-table scan are an accurate representation of the entire domain, an all-AMPs sample estimates statistics based on a small sample of the domain, and a dynamic AMP sample is not only based on an even smaller, more cardinality- and skew-sensitive sample of the domain, it also estimates fewer statistics.

Unfortunately, this decision is not so easily made in a production environment. Other factors must be accounted for, including the length of time required to collect the statistics and the resource consumption burden the collection of full-table statistics incurs.

The optimal compromise is provided by collecting object use and UDI counts (see Object Use and UDI Counts). The Optimizer can make a very accurate estimate of current table cardinality by adding the current INSERT count to the current cardinality estimate from residual collected statistics and then subtracting the current DELETE count from that sum.

In a production environment running multiple heavy query workloads, the problem concerns multiple levels of optimization.

Level Type of Optimization Considerations
Bottom Query If collecting full-table statistics makes queries run faster, what reasons could there be for collecting less accurate statistical samples?
Middle Workload If the act of collecting full-table statistics makes the system run slower, is there any way to collect statistical samples of table populations that are reasonably accurate and that will produce reasonably good query plans?
Top Mix What mix of query and workload optimization is best for overall system performance?

The following table compares the various characteristics of the three methods of collecting statistics and documents their respective most productive uses:

Method Characteristics Best Use
Full statistics
  • Collects all statistics for the data.
  • Time consuming.
  • Most accurate of the three methods of collecting statistics.
  • Stored in interval histograms in the Data Dictionary.
  • Best choice for columns or indexes with highly skewed data values.
  • Recommended for tables with fewer than 1,000 rows per AMP.
  • Recommended for selection columns having a moderate to low number of distinct values.
  • Recommended for most NUSIs, PARTITION columns, and other selection columns because collection time on NUSIs is very fast.
  • Recommended for all column sets or indexes where full statistics add value, and where sampling does not provide satisfactory statistical estimates.
Sampled statistics
  • Collects all statistics for the data, but not by accessing all rows in the table.
  • Significantly faster collection time than full statistics.
  • Stored in interval histograms in the Data Dictionary.
  • Acceptable for columns or indexes that are highly singular; meaning that their number of distinct values approaches the cardinality of the table.
  • Recommended for unique columns, unique indexes, and for columns or indexes that are highly singular.

    Experience suggests that sampled statistics are useful for very large tables; meaning tables with tens of billions of rows.

  • Not recommended for tables whose cardinality is less than 1000 times the number of AMPs in the system.
Dynamic AMP sample
  • Estimates fewer statistics than COLLECT STATISTICS does.

    Statistics estimated include Cardinalities and Average rows per value for all columns.

    For nonunique secondary indexes only, the following additional statistics are estimated: Average rows per index, Average size of the index per AMP ,and Number of distinct values.

  • Extremely fast collection time, so is not detectable.
  • Stored in the file system data block descriptor for the table, not in interval histograms in the Data Dictionary.
  • Occurs automatically. Cannot be invoked by user.
  • Automatically refreshed when batch table INSERT DELETE operations exceed a threshold of 10% of table cardinality.

    Cardinality is not refreshed by individual INSERT or DELETE requests even if the sum of their updates exceed the 10% threshold.

  • Cached with the data block descriptor.
  • Not used for non-indexed selection criteria or indexed selection with non-equality conditions.
  • Good for cardinality estimates when there is little or no skew and the table has significantly more rows than the number of AMPs in the system.
  • Collects reliable statistics for NUSI columns when there is limited skew and the table has significantly more rows than the number of AMPs in the system.
  • Useful as a temporary fallback measure for columns and indexes on which you have not yet decided whether to collect statistics or not.

    Dynamic AMP sampling provides a reasonable fallback mechanism for supporting the optimization of newly devised ad hoc queries until you understand where collected statistics are needed to support query plans for them.

    Vantage stores cardinality estimates from dynamic AMP samples in the interval histogram for estimating table growth even when complete, fresh statistics are available.

To avoid recollecting statistics when the existing statistics are still fresh, you should use one or more of the THRESHOLD options that you can specify for a COLLECT STATISTICS request. When you specify a threshold that must be met for statistics on an index or column set to be recollected, you enable Vantage to use various methods to determine whether the statistics it already has have become stale or not. If the existing statistics meet your criteria for freshness, Vantage ignores a request you submit to recollect those statistics and only recollects them when they meet your specified criteria. You can specify similar criteria for recollecting statistics using sampling rather than a full-table scan. The Optimizer uses these directives together with object use count information to determine when unused statistics should be recollected or dropped. See Object Use and UDI Counts and the information about COLLECT STATISTICS (Optimizer Form) in Teradata Vantage™ - SQL Data Definition Language Syntax and Examples, B035-1144 for further information about using thresholds and sampling criteria with your recollections of index and column statistics.

Using Interval Histogram Statistics or a Dynamic AMP Sample for NUSI Subtable Statistics

Because the statistics contained in the interval histograms for a NUSI subtable (see Interval Histograms) can be so stale that a dynamic AMP sample would produce more accurate cardinality estimates, the Optimizer bases its selection of which statistics to use based on the following set of rules.
  • The number of NUSI subtable index rows in an AMP is the number of distinct values in the index columns.

    The Optimizer assumes that all of the distinct values it samples from a single AMP exist on all of the AMPs in the system.

  • If a NUSI has an interval histogram, and if that histogram has fewer distinct values than a dynamic AMP sample gathers, then the system increases the number of index values to the number of values obtained from the dynamic AMP sample, with the exception of the primary index.

    This counters the problem of stale NUSI statistics.

  • The system sets the number of distinct values for unique indexes to the table cardinality by overriding the values from either the interval histogram or from a dynamic AMP sample.

The implication of this is that you need not refresh the statistics for a NUSI each time new data is loaded into its base table as long as the distribution of the primary index is fairly uniform.

You must collect or refresh NUSI statistics for the following situations:
  • Tables have a high correlation between the primary index and NUSI column sets.
  • Workloads contain frequently submitted queries with single-table predicates.

    The Optimizer needs interval histograms to make accurate single-table cardinality estimates.

Time and Resource Consumption Factors in Deciding How to Collect Statistics

The elapsed time to collect statistics and the resources consumed in doing so are the principal factors that mitigate collecting statistics on the entire population of a table rather than collecting what are probably less accurate statistics from a sampled subset of the full population which, in turn, will possibly result in less optimal query plans.

You cannot collect sampled single-column PARTITION statistics. The system allows you to submit such a request, but does not honor it. Instead, Vantage sets the sampling percentage to 100%. Sampled collection is permitted for multicolumn PARTITION statistics.

The elapsed time required to collect statistics varies as a function of the following factors.
  • Base table cardinality
  • Number of distinct index or non-indexed column values
  • System configuration

If you collect statistics on multiple columns and indexes of the same table, the process can easily take two to four times longer hours to complete, compared to collecting statistics on a single column. When you add the time required to collect statistics for numerous smaller tables in the database to the mix, the time required to collect statistics for all the tables can be surprisingly large. You might even decide that the necessary time for collection is excessive for your production environment, particularly if you have a narrow time window for collecting statistics.

Collecting full-table statistics is not just time consuming; it also places a performance burden on the system, consuming CPU and disk I/O resources that would otherwise be devoted to query processing and other application workloads. You might decide that collecting statistics places too many burdens on system resources to justify recollecting statistics on a daily, weekly, or even monthly basis.

After examining all these considerations, you might even conclude that any recollecting of statistics is an unacceptable burden for your production environment.

Because collecting and recollecting statistics can be so time consuming, you should consider specifying one or more of the sampling or threshold options (or both) when you first collect statistics on a table.

These options enable the Optimizer to decide when statistics have become stale rather than forcing a database administrator into the position of making the determination. When these collection options are in place, the Optimizer does not honor requests to recollect statistics that are not stale, so you can submit COLLECT STATISTICS more frequently, knowing that if the Optimizer does not need to freshen the statistics it requires to optimize a request, it does not recollect those statistics needlessly. See the information about COLLECT STATISTICS (Optimizer Form) in Teradata Vantage™ - SQL Data Definition Language Syntax and Examples, B035-1144 for more information about the USING clause sampling and threshold options.

An Example of How Stale Statistics Can Produce a Poor Query Plan

The optimal way to determine whether statistics that have been collected on a database object are stale is to use the THRESHOLD logic of the COLLECT STATISTICS (Optimizer Form) statement (see Teradata Vantage™ - SQL Data Definition Language Syntax and Examples, B035-1144).

The following example is extreme, but is instructive in demonstrating how negatively bad statistics can affect the query plan the Optimizer generates.

Consider two tables, A and B:

Table Name Statistics

Collected?

 Cardinality When Statistics Were Collected Current Cardinality
A Yes 1,000 1,000,000
B No unknown 75,000

If a Product Join between table A and table B is necessary for a given query, and one of the tables must be duplicated on all AMPs, then the Optimizer will select table A to be duplicated because, as far as it knows from the available statistics, only 1,000 rows must be redistributed, as opposed to the far greater 75,000 rows from table B.

In reality, table A currently has a cardinality that is three orders of magnitude larger than its cardinality at the time its statistics were collected: one million rows, not one thousand rows, and the Optimizer makes a very bad decision by duplicating the million rows of table A instead of the 75,000 rows of table B). As a result, the query runs much longer than necessary.

There are 2 general circumstances under which statistics can be considered to be stale.
  • The number of rows in the table has changed significantly.
  • The range of values for an index or a column of a table for which statistics have been collected has changed significantly.

    Sometimes you can infer this from the date and time the statistics were last collected, or by the nature of the column. For example, if the column in question stores transaction dates, and statistics on that column were last gathered a year ago, it is certain that the statistics for that column are stale.

The best way to handle this is to enable the Optimizer to determine when statistics are stale by specifying one or more threshold options when you collect or recollect statistics on a table or single-table view. Collecting and recollecting statistics using one or more of the threshold options makes it possible for the Optimizer to reject recollecting statistics when it determines that the current statistics are not stale.

You can obtain the number of unique values for each statistic on a table, as well as the date and time the statistics were last gathered, using the following request:

HELP STATISTICS table_name;

For statistics on unique indexes, you can cross check values reported by HELP STATISTICS by comparing the row count returned by the following query:

SELECT COUNT(*)
FROM table_name;

For statistics on nonunique columns, you can cross check the HELP STATISTICS report by comparing it with the count returned by the following query:

SELECT COUNT(DISTINCT columnname)
FROM table_name;

Teradata Viewpoint Stats Manager

The Stats Manager portlet allows you to manage statistics collection, which includes the ability to collect and analyze statistics, create and control jobs, and manage recommendations.

You can use Stats Manager to perform the following tasks:
  • View statistics on your system
  • Identify statistics to be collected and schedule statistic collection
  • Identify and collect missing statistics
  • Detect and refresh stale statistics
  • Identify and discontinue collecting unused statistics
  • View when statistics were last collected and are scheduled for collection again

See Teradata® Viewpoint User Guide, B035-2206 for information about the uses and capabilities of this portlet.