Derived Statistics | Optimizer Process | VantageCloud Lake - Derived Statistics - Teradata Vantage

Teradata® VantageCloud Lake

Deployment
VantageCloud
Edition
Lake
Product
Teradata Vantage
Published
January 2023
ft:locale
en-US
ft:lastEdition
2024-12-11
dita:mapPath
phg1621910019905.ditamap
dita:ditavalPath
pny1626732985837.ditaval
dita:id
phg1621910019905

The initial cardinality estimates based on interval histogram statistics, from which the Optimizer begins the process of query optimization, are adjusted dynamically throughout the optimization process to gain increasing accuracy by applying information derived from database constraints, query predicates, and join indexes.

These statistics are called derived statistics. Derived statistics enable the Optimizer to reoptimize a request in mid-query, without having to reoptimize the entire query from the beginning.

Definition of Derived Statistics

Derived statistics are snapshots, derived from base table interval histogram statistics, that the Optimizer initializes and transforms while optimizing a query. Derived statistics are cardinality estimates that are transformed from constraint sources, including query predicates, CHECK and referential integrity constraints, and join indexes, and then adjusted dynamically at each stage of the query optimization process. That is, derived statistics represent the demographics of column values after applying query predicates and demographic information derived from other sources.

Optimizing queries based on information derived from integrity constraints falls into the general category of semantic query optimization, which was initially developed in the field of deductive databases. Semantic query optimization methods can also add or subtract derived constraints from a request to better optimize it.

For example, the Join Planner must know the demographics of join columns after applying single-table predicates or after making a join. This information is derived and propagated throughout query optimization using the derived statistics infrastructure, and is used consistently in all cost and estimation formulas.

The derived statistics framework also uses multiple techniques for dealing with stale statistics, including comparing cardinality estimates from a dynamic AMP sample with the statistics stored in the relevant interval histogram, and bidirectional inheritance of statistics between a base table and its supporting indexes, using whichever set has the more recent collection timestamp. See Statistical Inheritance by Join Indexes.

For more information about stale statistics, see Using Derived Statistics to Compensate for Stale Statistics.

Bidirectional inheritance describes how base tables and their underlying indexes can inherit and use existing statistics when either database object in an index-table pair has no existing interval histogram statistics.

If both database objects have interval histogram statistics, the Optimizer assumes that the more recently collected statistics are more accurate, and uses the set with the more recent collection timestamp.

Derived Statistics Flow

The following table and join index definitions are used for the query that shows the flow of derived statistics usage by the Optimizer to generate more accurate cardinality estimates:

CREATE TABLE t2 (
  a2 INTEGER PRIMARY KEY,
  b2 INTEGER,
  c2 CHARACTER(1) CHECK (c2 IN ('M', 'F')),
  d2 DATE);

CREATE TABLE t3 (
  a3 INTEGER,
  b3 INTEGER,
  c3 CHARACTER(5),
  d3 INTEGER);

CREATE JOIN INDEX ji_t1 AS
  SELECT a1, d1
  FROM t1
  WHERE b1 > 10
  AND   c1 = 'Teradata');

CREATE JOIN INDEX aji_t3 AS
  SELECT a3, d3, COUNT(*)
  FROM t3
  WHERE b3 < 100
  GROUP BY 1,2);

Given these table and join index definitions, the flow chart that follows later in this topic shows how the Optimizer uses derived statistics to more accurately estimate cardinalities for the following query:

SELECT *
FROM t1, t2, t3
WHERE b1 > 10
AND   c1 = ‘Teradata’
AND   b3 < 50
AND   d1 = d2
AND   a2 = a3
AND   d2 = d3;

The stages of deriving statistics for this query are as follows:

  1. Refine and derive base table statistics using join index statistics.
  2. Join tables t1 and t2, consuming the term d1=d2 and producing the interim join relation R1.

    d1 and d2 are then merged into an EquiSet (see Using Join Predicate Redundancy to Derive Column Demographics After Each Binary Join), which takes the smaller of the 2 unique values cardinalities as MIN(200,1500), or 200.

    The entries for d1 and d2 are then removed from the derived statistics set.

  3. Join table t3 with the interim join relation R1, consuming the terms a2=a3 and d2=d3 and producing the following set of derived statistics cardinalities for join relation R2.
R2  
            Column Set Number of Unique Values
(b1)                  1000
(c1)                  2000
(d1, d2, d3)                    200
(a2, a3)                    100
(a2, d2)                    600
(a3, d3)                    600

Table representation of derived statistics

Column Heading Represents
Col the column set for which the number of unique values is derived by the derived statistics subsystem.
NUV the number of unique values for a column set as derived by the derived statistics subsystem.