16.10 - Optimizer Statistics and Demographics - Teradata Database

Teradata Database SQL Request and Transaction Processing

prodname
Teradata Database
vrm_release
16.10
created_date
June 2017
category
Programming Reference
User Guide
featnum
B035-1142-161K

Functions of Optimizer Statistics and Demographics

The Optimizer uses statistics and demographics for several different purposes. Without full or all-AMP sampled statistics, query optimization must rely on extrapolation (see Using Extrapolation to Replace Stale Statistics), object use counts (see Object Use and UDI Counts), or dynamic AMP sample estimates of table cardinality, which does not collect all of the statistics that a COLLECT STATISTICS request does.

Besides estimated cardinalities, dynamic AMP samples also collect a few other statistics, but far fewer than are collected by a COLLECT STATISTICS request.

Statistics and demographics provide the Optimizer with information it uses to reformulate queries in ways that permit it to produce the least costly access and join plans. The critical issues you must evaluate when deciding whether to collect statistics are not whether query optimization can or cannot occur in the face of inaccurate statistics, but the following pair of antagonistic questions:

  • How accurate must the available statistics be in order to generate the best possible query plan?
  • How poor a query plan are you willing to accept?

It is extremely important to have reasonably accurate statistical and demographic data about your tables at all times.

For example, if the statistics for a table are stale, the Optimizer might estimate the cardinality after one processing step as 10,000 rows, when the actual query returns only 15 rows at that step. The remaining steps in the query plan are all thrown off by the misestimation that led to the result of step 1, and the performance of the request is suboptimal.

The Optimizer avoids many of the problems that histogram statistics have presented historically in two ways.

Purposes of Statistics and Demographics

The following list is a very high-level description of the most important purposes for column and index statistics and demographics:

  • The Optimizer uses statistics and demographics to determine whether it should generate a query plan that uses an index instead of performing a full-table scan.
  • The Optimizer uses statistics and demographics to estimate the cardinalities of intermediate spools based on the qualifying conditions specified by the request.

    The estimated cardinality of intermediate results is critical for the determination of both optimal join orders for tables and the kind of join method that should be used to make those joins.

    For example, should 2 tables or spools be redistributed and then merge joined, or should one of the tables or spools be duplicated and then Product Joined with the other. Depending on how accurate the statistics are, the generated join plan can vary so greatly that the same query can take only seconds to complete using one join plan, but take hours to complete using another.

  • Statistics collected on the PARTITION system-derived column permit the Optimizer to better estimate costs and cardinalities. This is also true for nonpartitioned tables and column-partitioned tables.

    Teradata Database also uses PARTITION statistics for estimates when predicates are based on the PARTITION column, for example WHERE PARTITION IN (3,4).

    Because the system usually can collect SUMMARY statistics very rapidly, you should collect these statistics for all tables when Teradata Database updates them after ETL jobs complete. You should also recollect PARTITION statistics for row-partitioned tables after ETL jobs complete or any time that you modify a partitioning expression using an ALTER TABLE request.

    This is another reason that you should keep your statistics as current as you can.

For information about the Viewpoint Stats Manager, see the Teradata Viewpoint User Guide.

In some cases, dynamic all-AMP (see Sampled Statistics) or dynamic single-AMP sampled statistics (see Dynamic AMP Sampling) are not accurate enough for the Optimizer to generate an optimal, or even a good, join plan. However, it is often true that statistics collected by sampling small subpopulations of table rows can be as good as those collected using a full-table scan.

The value of collecting full-table statistics is that they provide the Optimizer with the most accurate information that can be gathered for making the best possible plan cost estimates. In most cases, the creation and propagation of derived statistics by the Optimizer can balance all but the most stale statistics (see Derived Statistics).

Statistical accuracy is fundamentally important for any plan because the effect of suboptimal access and join plans generated from inaccurate statistics, of which there can be many in the optimization of a complex query, is multiplicative.

A plan generated using full-table statistics is expected to be at least as good as a plan generated using any form of sampled statistics. There is a high probability that a plan based on full-table statistics will be better, and sometimes significantly better, than a plan based on any form of sampled statistics.

Automatic Collection and Recollection of Statistics

The THRESHOLD options for the COLLECT STATISTICS statement enable you to set periodic recollection thresholds for recollecting statistics. Importantly, if the thresholds you set are not met, the existing statistics are considered not to be stale, so Teradata Database does not recollect them even if you submit an explicit COLLECT STATISTICS request. The metadata that Teradata Database uses to manage the automatic recollection of statistics is stored in several tables in the TDSTATS database.

Several features work with the COLLECT STATISTICS statement to manage the statistics that you collect, including the STATSUSAGE and USECOUNT options for the BEGIN QUERY LOGGING and REPLACE QUERY LOGGING statements and a number of APIs that can analyze database objects to identify situations where statistics management can be improved by the system. For documentation of these APIs, see Application Programming Reference.

Relative Benefits of Collecting Full-Table and Sampled Statistics

When viewed in isolation, the decision between full-table and all-AMPs sampled statistics is a simple one: always collect full-table statistics, because they provide the best opportunity for producing optimal query plans.

While statistics collected from a full-table scan are an accurate representation of the entire domain, an all-AMPs sample estimates statistics based on a small sample of the domain, and a dynamic AMP sample is not only based on an even smaller, more cardinality- and skew-sensitive sample of the domain, it also estimates fewer statistics.

Unfortunately, this decision is not so easily made in a production environment. Other factors must be accounted for, including the length of time required to collect the statistics and the resource consumption burden the collection of full-table statistics incurs.

The optimal compromise is provided by collecting object use and UDI counts (see Object Use and UDI Counts). The Optimizer can make a very accurate estimate of current table cardinality by adding the current INSERT count to the current cardinality estimate from residual collected statistics and then subtracting the current DELETE count from that sum.

In a production environment running multiple heavy query workloads, the problem concerns multiple levels of optimization.

Level Type of Optimization Considerations
Bottom Query If collecting full-table statistics makes queries run faster, what reasons could there be for collecting less accurate statistical samples?
Middle Workload If the act of collecting full-table statistics makes the system run slower, is there any way to collect statistical samples of table populations that are reasonably accurate and that will produce reasonably good query plans?
Top Mix What mix of query and workload optimization is best for overall system performance?

The following table compares the various characteristics of the three methods of collecting statistics and documents their respective most productive uses:

Method Characteristics Best Use
Full statistics
  • Collects all statistics for the data.
  • Time consuming.
  • Most accurate of the three methods of collecting statistics.
  • Stored in interval histograms in the Data Dictionary.
  • Best choice for columns or indexes with highly skewed data values.
  • Recommended for tables with fewer than 1,000 rows per AMP.
  • Recommended for selection columns having a moderate to low number of distinct values.
  • Recommended for most NUSIs, PARTITION columns, and other selection columns because collection time on NUSIs is very fast.
  • Recommended for all column sets or indexes where full statistics add value, and where sampling does not provide satisfactory statistical estimates.
Sampled statistics
  • Collects all statistics for the data, but not by accessing all rows in the table.
  • Significantly faster collection time than full statistics.
  • Stored in interval histograms in the Data Dictionary.
  • Acceptable for columns or indexes that are highly singular; meaning that their number of distinct values approaches the cardinality of the table.
  • Recommended for unique columns, unique indexes, and for columns or indexes that are highly singular.

    Experience suggests that sampled statistics are useful for very large tables; meaning tables with tens of billions of rows.

  • Not recommended for tables whose cardinality is less than 20 times the number of AMPs in the system.
Dynamic AMP sample
  • Estimates fewer statistics than COLLECT STATISTICS does.

    Statistics estimated include Cardinalities and Average rows per value for all columns.

    For nonunique secondary indexes only, the following additional statistics are estimated: Average rows per index, Average size of the index per AMP ,and Number of distinct values.

  • Extremely fast collection time, so is not detectable.
  • Stored in the file system data block descriptor for the table, not in interval histograms in the Data Dictionary.
  • Occurs automatically. Cannot be invoked by user.
  • Automatically refreshed when batch table INSERT … DELETE operations exceed a threshold of 10% of table cardinality.

    Cardinality is not refreshed by individual INSERT or DELETE requests even if the sum of their updates exceed the 10% threshold.

  • Cached with the data block descriptor.
  • Not used for non-indexed selection criteria or indexed selection with non-equality conditions.
  • Good for cardinality estimates when there is little or no skew and the table has significantly more rows than the number of AMPs in the system.
  • Collects reliable statistics for NUSI columns when there is limited skew and the table has significantly more rows than the number of AMPs in the system.
  • Useful as a temporary fallback measure for columns and indexes on which you have not yet decided whether to collect statistics or not.

    Dynamic AMP sampling provides a reasonable fallback mechanism for supporting the optimization of newly devised ad hoc queries until you understand where collected statistics are needed to support query plans for them.

    Teradata Database stores cardinality estimates from dynamic AMP samples in the interval histogram for estimating table growth even when complete, fresh statistics are available.

To avoid recollecting statistics when the existing statistics are still fresh, you should use one or more of the THRESHOLD options that you can specify for a COLLECT STATISTICS request. When you specify a threshold that must be met for statistics on an index or column set to be recollected, you enable Teradata Database to use various methods to determine whether the statistics it already has have become stale or not. If the existing statistics meet your criteria for freshness, Teradata Database ignores a request you submit to recollect those statistics and only recollects them when they meet your specified criteria. You can specify similar criteria for recollecting statistics using sampling rather than a full-table scan. The Optimizer uses these directives together with object use count information to determine when unused statistics should be recollected or dropped. See Object Use and UDI Counts and “COLLECT STATISTICS (Optimizer Form)” in SQL Data Definition Language for further information about using thresholds and sampling criteria with your recollections of index and column statistics.

Where the Optimizer Looks for Statistics When It Is Building a Query Plan

The following flowchart diagrams the process by which the Optimizer looks for statistics when it is creating a query plan for a request:



The general process stages the Optimizer follows while performing this task are as follows:

  1. On receiving a parsed and rewritten request, the Optimizer looks for the data block descriptor (DBD) in the dictionary cache. If the table DBD is in the dictionary cache, then the Optimizer extracts the dynamic AMP samples it needs. If the table DBD is not in the dictionary cache, then the Optimizer reads the DBD from disk and then extracts the dynamic AMP samples it needs.

    Each time a table header is read from disk, Teradata Database collects a fresh dynamic AMP sample of its statistics, places it in the appropriate DBD, which is then cached in the dictionary cache along with the table header, and stores the new cardinality estimate in the interval histogram for the table (when Teradata Database caches histogram statistics, it writes them to the statistics cache rather than the dictionary cache).

    Remember that the statistics derived from a dynamic AMP sample are not as complete as those gathered by a COLLECT STATISTICS request and stored in interval histograms in the dictionary.

  2. The Optimizer calls a routine that looks for collected statistics on each column set or index it needs whenever it requires statistics to build a query plan.
    1. Teradata Database searches the statistics cache for the relevant interval histogram.
    2. If the desired interval histogram is not cached, Teradata Database issues an express request that retrieves it from the appropriate dictionary tables on disk.

      Teradata Database stores the histogram data in the DBC.StatsTbl dictionary table and stores metadata for every automated collection of statistics in various tables in the TDSTATS database, which you can access using the appropriate API procedures documented in Application Programming Reference. Automated collection of statistics is controlled by the THRESHOLD option of the COLLECT STATISTICS statement.

  3. If no statistics have been collected for the desired column set, the Optimizer proceeds as outlined in the following table:
    IF the desired histogram is … THEN …
    located place it in the statistics cache.
    not located determine if the column set represents an index.
    • If the column set is an index, then use the dynamic AMP sampled statistics from the DBD to estimate cardinalities.
    • If the column set is not an index, then use an appropriate heuristic to estimate cardinalities.
  4. Purge the statistics from the statistics cache when the Optimizer recollects statistics.

Using Interval Histogram Statistics or a Dynamic AMP Sample for NUSI Subtable Statistics

Because the statistics contained in the interval histograms for a NUSI subtable (see Interval Histograms) can be so stale that a dynamic AMP sample would produce more accurate cardinality estimates, the Optimizer bases its selection of which statistics to use based on the following set of rules.

  • The number of NUSI subtable index rows in an AMP is the number of distinct values in the index columns.

    The Optimizer assumes that all of the distinct values it samples from a single AMP exist on all of the AMPs in the system.

  • If a NUSI has an interval histogram, and if that histogram has fewer distinct values than a dynamic AMP sample gathers, then the system increases the number of index values to the number of values obtained from the dynamic AMP sample, with the exception of the primary index.

    This counters the problem of stale NUSI statistics.

  • The system sets the number of distinct values for unique indexes to the table cardinality by overriding the values from either the interval histogram or from a dynamic AMP sample.

The implication of this is that you need not refresh the statistics for a NUSI each time new data is loaded into its base table as long as the distribution of the primary index is fairly uniform.

This does not relieve you of the responsibility of collecting or refreshing NUSI statistics for the following situations, where:
  • Tables have a high correlation between the primary index and NUSI column sets.
  • Workloads contain frequently submitted queries with single-table predicates.

    The Optimizer needs interval histograms to make accurate single-table cardinality estimates.

The GeoGrid structure that Teradata Database uses to collect population statistics on both geospatial columns and geospatial NUSIs is stored in the DBC.StatsTbl table as part of the statistics histogram.

For further information, see GeoGrid Population Statistics.

Using Hilbert R-Trees for NUSI Subtable Summary Statistics on Geospatial Columns

Teradata Database supports NUSIs on geospatial columns using GeoGrids for system-wide population statistics (for information about GeoGrids and the statistics they maintain, see GeoGrid Population Statistics) and Hilbert R-trees for summary statistics. The Hilbert R-tree is a indexing structure for multidimensional objects that is organized as a tree with n levels. It is built from a multilevel hierarchy of n-1 directory levels stemming from a single root, or header, block and 1 leaf level. The hierarchy is constructed using directory blocks containing directory rows and leaf blocks containing leaf rows.

A system routine generates a linear value for each MBR in a Hilbert R-tree called a Hilbert value. The Hilbert value for an MBR is associated with its center point, and the Hilbert R-tree in which a Hilbert value is generated uses the value to make navigational decisions within the geospatial NUSI it defines based on which Hilbert value associated with a directory block row is closest to the Hilbert value associated with the entry being inserted into the index.

The following graphic provides a simplified drawing of an R-tree in the context of Teradata NUSIs:



The directory rows contain pointers to the child blocks in the level below them. Both the directory rows and the leaf rows have predetermined fixed footprints. The directory blocks and leaf blocks are both have maximum limits on the number of rows that each block can contain. As more and more rows are inserted into the tree, the individual blocks to contain them must be split each time the maximum row count is exceeded.

R-trees grow from the bottom up, meaning that at first leaf blocks are split; then additional rows must be added to the parent directory block to accommodate the split leaf blocks. The parent block directory must eventually be split, and so on.

The Teradata Database Hilbert R-tree is built on top of the file system. Complex indexes, such as the Hilbert R-tree, are typically realized as a collection of blocks, containing rows, which contain pointers to other blocks.

Time and Resource Consumption Factors in Deciding How to Collect Statistics

The elapsed time to collect statistics and the resources consumed in doing so are the principal factors that mitigate collecting statistics on the entire population of a table rather than collecting what are probably less accurate statistics from a sampled subset of the full population which, in turn, will possibly result in less optimal query plans.

You cannot collect sampled single-column PARTITION statistics. The system allows you to submit such a request, but does not honor it. Instead, Teradata Database sets the sampling percentage to 100%. Sampled collection is permitted for multicolumn PARTITION statistics.

The elapsed time required to collect statistics varies as a function of the following factors.

  • Base table cardinality
  • Number of distinct index or non-indexed column values
  • System configuration

For a 32-node production system, collecting statistics on a single column of a 100 million row table might take an hour. If you collect statistics on multiple columns and indexes of the same table, the process can easily take 4 hours to complete. When you add the time required to collect statistics for numerous smaller tables in the database to the mix, the time required to collect statistics for all the tables can be surprisingly large. You might even decide that the necessary time for collection is excessive for your production environment, particularly if you have a narrow time window for collecting statistics.

Collecting full-table statistics is not just time consuming; it also places a performance burden on the system, consuming CPU and disk I/O resources that would otherwise be devoted to query processing and other application workloads. You might decide that collecting statistics places too many burdens on system resources to justify recollecting statistics on a daily, weekly, or even monthly basis.

After examining all these considerations, you might even conclude that any recollecting of statistics is an unacceptable burden for your production environment.

Because collecting and recollecting statistics can be so time consuming, you should consider specifying one or more of the sampling or threshold options (or both) when you first collect statistics on a table.

These options enable the Optimizer to decide when statistics have become stale rather than forcing a database administrator into the position of making the determination. When these collection options are in place, the Optimizer does not honor requests to recollect statistics that are not stale, so you can submit COLLECT STATISTICS more frequently, knowing that if the Optimizer does not need to freshen the statistics it requires to optimize a request, it does not recollect those statistics needlessly. See “COLLECT STATISTICS (Optimizer Form)” in SQL Data Definition Language for more information about the USING clause sampling and threshold options.

An Example of How Stale Statistics Can Produce a Poor Query Plan

The optimal way to determine whether statistics that have been collected on a database object are stale is to use the THRESHOLD logic of the COLLECT STATISTICS (Optimizer Form) statement (see “COLLECT STATISTICS (Optimizer Form)” in SQL Data Definition Language).

The following example is extreme, but is instructive in demonstrating how negatively bad statistics can affect the query plan the Optimizer generates.

Consider two tables, A and B:

Table Name Statistics

Collected?

 Cardinality When Statistics Were Collected Current Cardinality
A Yes 1,000 1,000,000
B No unknown 75,000

If a Product Join between table A and table B is necessary for a given query, and one of the tables must be duplicated on all AMPs, then the Optimizer will select table A to be duplicated because, as far as it knows from the available statistics, only 1,000 rows must be redistributed, as opposed to the far greater 75,000 rows from table B.

In reality, table A currently has a cardinality that is three orders of magnitude larger than its cardinality at the time its statistics were collected: one million rows, not one thousand rows, and the Optimizer makes a very bad decision by duplicating the million rows of table A instead of the 75,000 rows of table B). As a result, the query runs much longer than necessary.

There are 2 general circumstances under which statistics can be considered to be stale.

  • The number of rows in the table has changed significantly.
  • The range of values for an index or a column of a table for which statistics have been collected has changed significantly.

    Sometimes you can infer this from the date and time the statistics were last collected, or by the nature of the column. For example, if the column in question stores transaction dates, and statistics on that column were last gathered a year ago, it is certain that the statistics for that column are stale.

The best way to handle this is to enable the Optimizer to determine when statistics are stale by specifying one or more threshold options when you collect or recollect statistics on a table or single-table view. Collecting and recollecting statistics using one or more of the threshold options makes it possible for the Optimizer to reject recollecting statistics when it determines that the current statistics are not stale.

You can obtain the number of unique values for each statistic on a table, as well as the date and time the statistics were last gathered, using the following request:

     HELP STATISTICS table_name;

For statistics on unique indexes, you can cross check values reported by HELP STATISTICS by comparing the row count returned by the following query:

     SELECT COUNT(*)
     FROM table_name;

For statistics on nonunique columns, you can cross check the HELP STATISTICS report by comparing it with the count returned by the following query:

     SELECT COUNT(DISTINCT columnname)
     FROM table_name;

Teradata Viewpoint Stats Manager

The Stats Manager portlet allows you to manage Teradata Database statistics collection, which includes the ability to collect and analyze statistics, create and control jobs, and manage recommendations.

You can use Stats Manager to perform the following tasks:

  • View statistics on your system
  • Identify statistics to be collected and schedule statistic collection
  • Identify and collect missing statistics
  • Detect and refresh stale statistics
  • Identify and discontinue collecting unused statistics
  • View when statistics were last collected and are scheduled for collection again

See Teradata Viewpoint User Guide for information about the uses and capabilities of this portlet.