16.10 - Derived Statistics - Teradata Database

Teradata Database SQL Request and Transaction Processing

Teradata Database
June 2017
Programming Reference
User Guide

The initial cardinality estimates based on interval histogram statistics, from which the Optimizer begins the process of query optimization, are adjusted dynamically throughout the optimization process to gain increasing accuracy by applying information derived from various database constraints, query predicates, and hash and join indexes.

Because of the way the system calculates these statistics, they are referred to as derived statistics. Derived statistics enable the Optimizer to achieve the same end result of dynamically reoptimizing a request in mid-query when it has been determined to have a suboptimal plan without having to reoptimize the entire query from the beginning.

Definition of Derived Statistics

Derived statistics are snapshots, derived from base table interval histogram statistics, that the Optimizer initializes and transforms while it optimizes a query. They are cardinality estimates that are transformed from various constraint sources, including query predicates, CHECK and referential integrity constraints, and hash and join indexes, and then adjusted dynamically at each stage of the query optimization process. In other words, derived statistics represent the demographics of column values after applying query predicates and demographic information derived from other sources.

Optimizing queries based on information derived from integrity constraints falls into the general category of semantic query optimization, which was initially developed in the field of deductive databases. Note that semantic query optimization methods can also add or subtract derived constraints from a request in order to better optimize it.

For example, the Join Planner needs to know the demographics of join columns after applying single-table predicates or after making a join. All this information is derived and propagated throughout query optimization using the derived statistics infrastructure, and it is used consistently in all cost and estimation formulas.

Derived statistics are propagated across optimization stages in a flat data structure associated with each relation accessed by the query. An entry in this structure is made for each base table statistic that is required in the query and for the information derived from other sources.

Each entry contains both static and dynamically adjusted information, such as the following:

  • Number of unique values
  • High modal frequency
  • Original interval histogram statistics
  • Uniqueness flag

At the first stage of query optimization, the information in the derived statistics is identical to the base table statistical information stored in the interval histogram. Teradata Database adjusts the initial state information and derives new information during the different stages of join planning, such as the following:

  • After applying single-table predicates
  • After doing a binary join
  • After doing aggregation

Because the original interval histogram statistics are also attached to every derived statistics entry, the Optimizer always has access to the original column or index demographics if it needs them.

All Optimizer estimation and costing logic obtains its demographic information from a single source: the derived statistics data structure. This enables all stages of query optimization to see a single version of the truth.

The derived statistics framework also employs several techniques for dealing with stale statistics, including comparing cardinality estimates obtained from a dynamic AMP sample with the statistics stored in the relevant interval histogram, and bidirectional inheritance of statistics between a base table and its supporting indexes, using whichever set has the more recent collection timestamp. See Statistical Inheritance by Hash and Join Indexes.

For more information about stale statistics, see Using Derived Statistics to Compensate for Stale Statistics.

Bidirectional inheritance is the term used to describe how base tables and their underlying indexes are able to inherit and use existing statistics from one another when either database object in an index-table pair has no existing interval histogram statistics.

If both database objects have existing interval histogram statistics, the Optimizer assumes that the more recently collected statistics are the more accurate, so it uses the set with the more recent collection timestamp.

Derived Statistics Processes

Rather than reverting to the original interval histogram statistics at the beginning of each stage of the optimization process, the Optimizer propagates all newly derived statistics to subsequent phases of optimization, refining its estimates as it goes. This applies to both single-table cardinality estimates and to join cardinality estimates. Such dynamic readjustment of cardinality estimates greatly reduces the multiplicative error propagation that otherwise produces increasingly, and often significantly less accurate, join cardinality estimates at each stage of join space analysis.

  • Derived statistics derive and adjust their estimates based on several criteria.
IF the statistics are for … THEN they are based on the predicates for this type of operation …
single tables single-table.
join operations joins.
aggregation operations grouping keys.

For example:

  • Deriving the number of unique values using single-table predicates.
         SELECT *
         FROM t1,t2
         WHERE d1=d2
         AND   d1 IN (1,2,3,4);

    It is possible to derive the number of unique values, 4, just from the single-table predicate of this query.

  • Deriving the number of unique values using single-table predicates and existing multicolumn statistics.

    Note that as a general rule, the system adjusts multicolumn demographics based on single-table predicates rather than vice versa. For example, suppose multicolumn statistics have been collected on (c1, d1). In the following request, the system adjusts the demographics of (c1, d1) based on the derived statistics for the single-table equality predicate d1=d2.

         SELECT *
         FROM t1,t2
         WHERE d1 = d2
         AND   c1 > 10;
  • Deriving the number of unique values using row partition elimination and multicolumn PARTITION statistics.

    In the following revision of the previous case, the second component of the predicate has been changed to an equality condition. If multicolumn statistics have been collected on (c1, d1), the system adjusts the derived demographics of the new entry based on the single-table predicate as well as deriving new demographics for d1.

         SELECT *
         FROM t1,t2
         WHERE d1 = d2
         AND   c1 = 10;
  • Deriving statistics on join columns using join predicates.
         SELECT x1, y1, COUNT(*)
         FROM t1,t2
         WHERE x1=x2
         AND   y1=y2;

    Assume for this case that the number of values for (x1,y1) = 100 and the number of values for (x2,y2)=50.

    The Optimizer derives the logically correct join cardinality of (x1,y1) as 50 and propagates that value to the join result for use in the next stage of the join ordering process.

  • Deriving the number of unique values from aggregate join index cardinality.

    The definition of the aggregate join index is as follows.

         CREATE JOIN INDEX aggji AS
           SELECT c1, d1, SUM(b1)
           FROM t1
           WHERE x1>10
           AND   y1=10
           GROUP BY c1, d1;

    The query is the following SELECT request.

         SELECT *
         FROM t1,t2
         WHERE t1.c1=t2.c2
         AND   x1>10
         AND   y1=10;

    The Optimizer takes the cardinality of (c1,d1) to be equal to the cardinality of the join index aggji.

  • Deriving the demographics of a base table from a sparse single-table join index defined on it.

    The definition of the single-table join index is as follows:

         CREATE JOIN INDEX stji AS
           SELECT c1,d1
           FROM t1
           WHERE x1 > 10
           AND   y1 = 10;

    The query is the following SELECT request:

         SELECT *
         FROM t1,t2
         WHERE c1 = c2
         AND   x1 > 10
         AND   y1 = 10;

    The Optimizer adjusts the demographics for t1.c1 based on the interval histogram statistics it reads for stji.c1.

  • Deriving hierarchical relationships from single- and multicolumn statistics.

    Hierarchical relationships between values are used to update and propagate demographic changes based on single-table predicates.

    Define the number of values in x as Nx.

    Define the number of values in y as Ny.

    Define the number of values in (x,y) as Nxy.

IF … THEN the value mapping between … AND the value mapping between …
Nx = Nxy Nx -->  Ny is derived as 1 -->  1 Ny -->  Nx is derived as 1 -->  Nxy/Ny
Ny = Nxy Ny -->  Nx is derived as 1 -->  1 Nx -->  Ny is derived as 1 -->  Nxy/Nx

Consider the following example.

     SELECT *
     FROM t1, t2
     WHERE d1 = d2
     AND   c1 = 10;

If the value mapping between c1 and d1 is 1 -->  5, then after applying the single-table condition, the derived statistics framework adjusts the number of unique values for d1 to 5.

  • Deriving the number of unique values using query predicates and column correlations.

    Assume a 1:20 mapping between c1 and d1. The Optimizer can use this information together with the predicates in the following query to estimate the number of unique values.

         SELECT *
         FROM d1,d2
         WHERE d1 = d2
         AND   c1 = 1;

    From the given information, the Optimizer derives the number of unique d1 values, which it determines to be 20.

  • The Optimizer captures and propagates session-level derived statistics across multiple requests on materialized global temporary and volatile tables if the following criteria are all true:
    • The multiple requests are all INSERT … SELECT statements.
    • No DELETE or UPDATE operations occur between successive INSERT … SELECT statements.
    • A SELECT query follows the succession of INSERT … SELECT statements without intervening DELETE or UPDATE operations against the target global temporary or volatile tables.

    Teradata Database automatically collects session-level derived statistics for materialized global temporary and volatile tables that are populated with data during the current session. The system uses session-level derived statistics in the same way it uses other derived statistics to make cardinality estimates for, and cost evaluations of, SQL requests. These session-level statistics are maintained in the global memory for a session and are dropped when the session logs off.

    For example, suppose you have a global temporary table gt1 and a volatile table vt1. The Optimizer retains and propagates the derived statistics from the successive INSERT … SELECT statements to the subsequent query made against gt1 and vt1.

         INSERT INTO gt1
         SELECT *
         FROM t1,t2
         WHERE a1=a2;
         INSERT INTO gt1
         SELECT *
         FROM t3,t4
         WHERE a3=a4;
         INSERT INTO vt1
         SELECT *
         FROM t5,t6
         WHERE a5=a6;
         SELECT *
         FROM gt1,vt1, orders, customers, …
         WHERE …;

    In this example, the Optimizer uses the statistics previously derived during the three preceding INSERT … SELECT operations for gt1 and vt1 in the succeeding query made against those tables.

    Note that these are all individual SQL requests, not component statements of a multistatement request.

    Also note the following counter indication: if there are existing statistics on a materialized global temporary table, then those statistics override any derived statistics.

    Global temporary and volatile tables populated in the following ways qualify for session-level derived statistics.

    • One or multiple INSERT … SELECT requests
    • CREATE TABLE AS … WITH DATA followed by one or multiple INSERT … SELECT requests.

    Session-level histograms are also inherited if the source table in an INSERT … SELECT or CREATE TABLE AS … WITH DATA request is a single table with no predicates. Note that the inherited statistics do not override collected statistics, and if a table is updated by DELETE or UPDATE requests after it has been populated, the system invalidates its session-level statistics.

    The Optimizer propagates derived statistics across the multiple SQL statements of a multistatement request under the following scenario.

         INSERT INTO t3
         SELECT a1,a2,b1,b2
         FROM t1,t2
         WHERE a1=a2
        ;SELECT *
         FROM t4,t3
         WHERE a4=a3;

    The system propagates the statistics derived for target table t3 from the INSERT … SELECT statement to t3 for the SELECT statement in the next part of this multistatement request.

  • The Optimizer acts proactively against stale statistics, using the following method to detect and adjust them during the process of adjusting derived statistics. This is in addition to the threshold logic used to determine whether statistics need be recollected or not.

    If the cardinality for a table or index differs by more than n percent from the cardinality determined by a dynamic-AMP sample, then that statistic is deemed to be stale.

    See Stale Statistics for further information.

    Teradata Database changes the existing cardinality count to the cardinality determined from a dynamic-AMP sample if all the following criteria are true.

    • The primary index for the table, hash index, or join index is not skewed.
    • The cardinality of the database object is greater than the number of AMPs for the system.
    • The sampled cardinality is greater than n percent of the estimated total cardinality for the database object.

      The value of n is set in your Cost Profile using the RASHistogramDeviationPct flag or RASHistAbsDeviation flags or both. For more information, see Optimizer Cost Profiles. The default value is 10 percent because the guideline on when statistics should be refreshed is to recollect when there is a change in the number of rows for a table or index subtable of 10 percent or more.

    When Teradata Database adjusts its cardinality estimate, it also adjusts its estimate of the number of unique values for rolling columns accordingly. For more information, see Definitions of Rolling, Static, and Hybrid Columns.

    When both join index and base table statistics are available, the Optimizer uses the more current estimates of the 2 based on their respective collection timestamps.

  • Single-table statistics, both on single columns and multiple columns, are inherited bidirectionally.
    • If a base table has statistics, but a non-sparse single-table join index or hash index subtable on that table has none, then the index inherits the base table statistics as its own.
    • If a non-sparse single-table join index or hash index subtable has statistics, but the base table on which the index is defined has none, then the base table inherits the index statistics as its own.
  • Single-level PARTITION statistics are inherited bidirectionally for single-column statistics only.

    Note that PARTITION statistics can also be collected on nonpartitioned and column-partitioned tables, and the Optimizer can use those statistics to estimate or extrapolate table cardinalities.

    The value for the column partition number of the system-derived column PARTITION for a column-partitioned table is always 1, so collected statistics on a PARTITION column equate to the number of rows in each combined row partition.

    Teradata Database does not support the bidirectional inheritance of multilevel PARTITION#L n statistics.

  • Interval histogram statistics are inherited from their supersets when available.

    For example, consider the following scenario:

    • An index is defined on column x1, but no statistics have been collected on it; therefore, there is no histogram for x1.
    • Multicolumn statistics have been collected on (x1,y1).
    • The query specifies a range predicate on x1.

    A dynamic AMP sample is not useful for making a range estimate; therefore, the Optimizer uses the histogram that is available for (x1,y1) to make its cardinality estimate, which it then stores in the x1 entry of the derived statistics data structure to be propagated to the next stage of the optimization process.

Derived Statistics Flow

The following table and join index definitions are used for the query that illustrates the flow of derived statistics usage by the Optimizer to generate more accurate cardinality estimates:

     CREATE TABLE t2 (
       b2 INTEGER,
       c2 CHARACTER(1) CHECK (c2 IN ('M', 'F')),
       d2 DATE);

     CREATE TABLE t3 (
       a3 INTEGER,
       b3 INTEGER,
       c3 CHARACTER(5),
       d3 INTEGER);

       SELECT a1, d1
       FROM t1
       WHERE b1 > 10
       AND   c1 = 'Teradata');

       SELECT a3, d3, COUNT(*)
       FROM t3
       WHERE b3 < 100
       GROUP BY 1,2);

Given these table and join index definitions, the flow chart that follows later in this topic shows how the Optimizer uses derived statistics to more accurately estimate cardinalities for the following query:

     SELECT *
     FROM t1, t2, t3
     WHERE b1 > 10
     AND   c1 = ‘Teradata’
     AND   b3 < 50
     AND   d1 = d2
     AND   a2 = a3
     AND   d2 = d3;

The stages of deriving various statistics for this query are as follows:

  1. Refine and derive base table statistics using join index statistics.
  2. Join tables t1 and t2, consuming the term d1=d2 and producing the interim join relation R1.

    d1 and d2 are then merged into an EquiSet (see Using Join Predicate Redundancy to Derive Column Demographics After Each Binary Join), which takes the smaller of the 2 unique values cardinalities as MIN(200,1500), or 200.

    The entries for d1 and d2 are then removed from the derived statistics set.

  3. Join table t3 with the interim join relation R1, consuming the terms a2=a3 and d2=d3 and producing the following set of derived statistics cardinalities for join relation R2.
            Column Set Number of Unique Values
(b1)                  1000
(c1)                  2000
(d1, d2, d3)                    200
(a2, a3)                    100
(a2, d2)                    600
(a3, d3)                    600


Column Heading Represents
Col the column set for which the number of unique values is derived by the derived statistics subsystem.
NUV the number of unique values for a column set as derived by the derived statistics subsystem.