The initial cardinality estimates based on interval histogram statistics, from which the Optimizer begins the process of query optimization, are adjusted dynamically throughout the optimization process to gain increasing accuracy by applying information derived from various database constraints, query predicates, and hash and join indexes.
Because of the way the system calculates these statistics, they are referred to as derived statistics. Derived statistics enable the Optimizer to achieve the same end result of dynamically reoptimizing a request in mid‑query when it has been determined to have a suboptimal plan without having to reoptimize the entire query from the beginning (Kabra and DeWitt, 1998).
Derived statistics are snapshots, derived from base table interval histogram statistics, that the Optimizer initializes and transforms while it optimizes a query. They are cardinality estimates that are transformed from various constraint sources, including query predicates, CHECK and referential integrity constraints, and hash and join indexes, and then adjusted dynamically at each stage of the query optimization process. In other words, derived statistics represent the demographics of column values after applying query predicates and demographic information derived from other sources.
Optimizing queries based on information derived from integrity constraints falls into the general category of semantic query optimization (see, for example, Chakravarthy et al., 1988; Levy et al., 1994; Levy and Savig, 1995), which was initially developed in the field of deductive databases (see, for example, Chakravarthy et al., 1990; Minker, 1988). Note that semantic query optimization methods can also add or subtract derived constraints from a request in order to better optimize it.
For example, the Join Planner needs to know the demographics of join columns after applying single-table predicates or after making a join. All this information is derived and propagated throughout query optimization using the derived statistics infrastructure, and it is used consistently in all cost and estimation formulas.
Derived statistics are propagated across optimization stages in a flat data structure associated with each relation accessed by the query. An entry in this structure is made for each base table statistic that is required in the query and for the information derived from other sources.
Each entry contains both static and dynamically adjusted information such as.
At the first stage of query optimization, the information in the derived statistics is identical to the base table statistical information stored in the interval histogram. Teradata Database adjusts the initial state information and derives new information during the different stages of join planning such as the following.
Because the original interval histogram statistics are also attached to every derived statistics entry, the Optimizer always has access to the original column or index demographics if it needs them.
All Optimizer estimation and costing logic obtains its demographic information from a single source: the derived statistics data structure. This enables all stages of query optimization to see a single version of the truth.
The derived statistics framework also employs several techniques for dealing with stale statistics (see “Using Derived Statistics to Compensate for Stale Statistics” on page 273), including comparing cardinality estimates obtained from a dynamic AMP sample with the statistics stored in the relevant interval histogram, and bidirectional inheritance of statistics between a base table and its supporting indexes, using whichever set has the more recent collection timestamp (see “Statistical Inheritance by Hash and Join Indexes” on page 220).
Bidirectional inheritance is the term used to describe how base tables and their underlying indexes are able to inherit and use existing statistics from one another when either database object in an index‑table pair has no existing interval histogram statistics.
If both database objects have existing interval histogram statistics, the Optimizer assumes that the more recently collected statistics are the more accurate, so it uses the set with the more recent collection timestamp.
See Kabra and DeWitt (1998) for an earlier dynamic re‑optimization process that is similar to that followed by derived statistics.
Rather than reverting to the original interval histogram statistics at the beginning of each stage of the optimization process, the Optimizer propagates all newly derived statistics to subsequent phases of optimization, refining its estimates as it goes. This applies to both single‑table cardinality estimates and to join cardinality estimates. Such dynamic readjustment of cardinality estimates greatly reduces the multiplicative error propagation that otherwise produces increasingly, and often significantly less accurate, join cardinality estimates at each stage of join space analysis.
IF the statistics are for … |
THEN they are based on the predicates for this type of operation … |
single tables |
single‑table. |
join operations |
joins. |
aggregation operations |
grouping keys. |
For example,
SELECT *
FROM t1,t2
WHERE d1=d2
AND d1 IN (1,2,3,4);
It is possible to derive the number of unique values, 4, just from the single‑table predicate of this query.
Note that as a general rule, the system adjusts multicolumn demographics based on
single‑table predicates rather than vice versa. For example, suppose multicolumn statistics
have been collected on (c1, d1). In the following request, the system adjusts the demographics of (c1, d1) based on the derived statistics for the single‑table equality predicate d1=d2
.
SELECT *
FROM t1,t2
WHERE d1 = d2
AND c1 > 10;
In the following revision of the previous case, the second component of the predicate has been changed to an equality condition. If multicolumn statistics have been collected on (c1, d1), the system adjusts the derived demographics of the new entry based on the single‑table predicate as well as deriving new demographics for d1.
SELECT *
FROM t1,t2
WHERE d1 = d2
AND c1 = 10;
SELECT x1, y1, COUNT(*)
FROM t1,t2
WHERE x1=x2
AND y1=y2;
Assume for this case that the number of values for (x1,y1) = 100 and the number of values for (x2,y2)=50.
The Optimizer derives the logically correct join cardinality of (x1,y1) as 50 and propagates that value to the join result for use in the next stage of the join ordering process.
The definition of the aggregate join index is as follows.
CREATE JOIN INDEX aggji AS
SELECT c1, d1, SUM(b1)
FROM t1
WHERE x1>10
AND y1=10
GROUP BY c1, d1;
The query is the following SELECT request.
SELECT *
FROM t1,t2
WHERE t1.c1=t2.c2
AND x1>10
AND y1=10;
The Optimizer takes the cardinality of (c1,d1) to be equal to the cardinality of the join index aggji.
The definition of the single‑table join index is as follows.
CREATE JOIN INDEX stji AS
SELECT c1,d1
FROM t1
WHERE x1 > 10
AND y1 = 10;
The query is the following SELECT request.
SELECT *
FROM t1,t2
WHERE c1 = c2
AND x1 > 10
AND y1 = 10;
The Optimizer adjusts the demographics for t1.c1 based on the interval histogram statistics it reads for stji.c1.
Hierarchical relationships between values are used to update and propagate demographic changes based on single‑table predicates.
Define the number of values in x as Nx.
Define the number of values in y as Ny.
Define the number of values in (x,y) as Nxy.
IF … |
THEN the value mapping between … |
AND the value mapping between … |
Nx = Nxy |
Nx → Ny is derived as 1 → 1 |
Ny → Nx is derived as 1 → Nxy/Ny |
Ny = Nxy |
Ny → Nx is derived as 1 → 1 |
Nx → Ny is derived as 1 → Nxy/Nx |
Consider the following example.
SELECT *
FROM t1, t2
WHERE d1 = d2
AND c1 = 10;
If the value mapping between c1 and d1 is 1 → 5, then after applying the single‑table condition, the derived statistics framework adjusts the number of unique values for d1 to 5.
Assume a 1:20 mapping between c1 and d1. The Optimizer can use this information together with the predicates in the following query to estimate the number of unique values.
SELECT *
FROM d1,d2
WHERE d1 = d2
AND c1 = 1;
From the given information, the Optimizer derives the number of unique d1 values, which it determines to be 20.
Teradata Database automatically collects session‑level derived statistics for materialized global temporary and volatile tables that are populated with data during the current session. The system uses session‑level derived statistics in the same way it uses other derived statistics to make cardinality estimates for, and cost evaluations of, SQL requests. These session‑level statistics are maintained in the global memory for a session and are dropped when the session logs off.
For example, suppose you have a global temporary table gt1 and a volatile table vt1. The Optimizer retains and propagates the derived statistics from the successive INSERT … SELECT statements to the subsequent query made against gt1 and vt1.
INSERT INTO gt1
SELECT *
FROM t1,t2
WHERE a1=a2;
INSERT INTO gt1
SELECT *
FROM t3,t4
WHERE a3=a4;
INSERT INTO vt1
SELECT *
FROM t5,t6
WHERE a5=a6;
SELECT *
FROM gt1,vt1, orders, customers, …
WHERE …;
In this example, the Optimizer uses the statistics previously derived during the three preceding INSERT … SELECT operations for gt1 and vt1 in the succeeding query made against those tables.
Note that these are all individual SQL requests, not component statements of a multistatement request.
Also note the following counter indication: if there are existing statistics on a materialized global temporary table, then those statistics override any derived statistics.
Global temporary and volatile tables populated in the following ways qualify for session‑level derived statistics.
Session‑level histograms are also inherited if the source table in an INSERT … SELECT or CREATE TABLE AS … WITH DATA request is a single table with no predicates. Note that the inherited statistics do not override collected statistics, and if a table is updated by DELETE or UPDATE requests after it has been populated, the system invalidates its session‑level statistics.
The Optimizer propagates derived statistics across the multiple SQL statements of a multistatement request under the following scenario.
INSERT INTO t3
SELECT a1,a2,b1,b2
FROM t1,t2
WHERE a1=a2
;SELECT *
FROM t4,t3
WHERE a4=a3;
The system propagates the statistics derived for target table t3 from the INSERT … SELECT statement to t3 for the SELECT statement in the next part of this multistatement request.
If the cardinality for a table or index differs by more than n percent from the cardinality determined by a dynamic‑AMP sample, then that statistic is deemed to be stale.
See “Stale Statistics” on page 273 for further information.
Teradata Database changes the existing cardinality count to the cardinality determined from a dynamic‑AMP sample if all the following criteria are true.
The value of n is set in your Cost Profile using the RASHistogramDeviationPct flag or RASHistAbsDeviation flags or both (see “Optimizer Cost Profiles” on page 307). The default value is 10 percent because the guideline on when statistics should be refreshed is to recollect when there is a change in the number of rows for a table or index subtable of 10 percent or more.
When Teradata Database adjusts its cardinality estimate, it also adjusts its estimate of the number of unique values for rolling columns accordingly (see “Definitions of Rolling, Static, and Hybrid Columns” on page 293).
When both join index and base table statistics are available, the Optimizer uses the more current estimates of the 2 based on their respective collection timestamps.
Note that PARTITION statistics can also be collected on nonpartitioned and column‑partitioned tables, and the Optimizer can use those statistics to estimate or extrapolate table cardinalities.
The value for the column partition number of the system-derived column PARTITION for a column‑partitioned table is always 1, so collected statistics on a PARTITION column equate to the number of rows in each combined row partition.
Teradata Database does not support the bidirectional inheritance of multilevel PARTITION#Ln statistics.
For example, consider the following scenario.
A dynamic AMP sample is not useful for making a range estimate; therefore, the Optimizer uses the histogram that is available for (x1,y1) to make its cardinality estimate, which it then stores in the x1 entry of the derived statistics data structure to be propagated to the next stage of the optimization process.
The following table and join index definitions are used for the query that illustrates the flow of derived statistics usage by the Optimizer to generate more accurate cardinality estimates.
CREATE TABLE t1 (
a1 INTEGER,
b1 INTEGER,
c1 CHARACTER(5),
d1 DATE);
CREATE TABLE t2 (
a2 INTEGER PRIMARY KEY,
b2 INTEGER,
c2 CHARACTER(1) CHECK (c2 IN ('M', 'F')),
d2 DATE);
CREATE TABLE t3 (
a3 INTEGER,
b3 INTEGER,
c3 CHARACTER(5),
d3 INTEGER);
CREATE JOIN INDEX ji_t1 AS
SELECT a1, d1
FROM t1
WHERE b1 > 10
AND c1 = 'Teradata');
CREATE JOIN INDEX aji_t3 AS
SELECT a3, d3, COUNT(*)
FROM t3
WHERE b3 < 100
GROUP BY 1,2);
Given these table and join index definitions, the flow chart that follows later in this topic shows how the Optimizer uses derived statistics to more accurately estimate cardinalities for the following query.
SELECT *
FROM t1, t2, t3
WHERE b1 > 10
AND c1 = ‘Teradata’
AND b3 < 50
AND d1 = d2
AND a2 = a3
AND d2 = d3;
The stages of deriving various statistics for this query are as follows.
1 Refine and derive base table statistics using join index statistics.
2 Join tables t1 and t2, consuming the term d1=d2 and producing the interim join relation R1.
d1 and d2 are then merged into an EquiSet (see “Using Join Predicate Redundancy to Derive Column Demographics After Each Binary Join” on page 236), which takes the smaller of the 2 unique values cardinalities as MIN(200,1500), or 200.
The entries for d1 and d2 are then removed from the derived statistics set.
3 Join table t3 with the interim join relation R1, consuming the terms a2=a3 and d2=d3 and producing the following set of derived statistics cardinalities for join relation R2.
R2 |
|
Column Set |
Number of Unique Values |
(b1) |
1000 |
(c1) |
2000 |
(d1, d2, d3) |
200 |
(a2, a3) |
100 |
(a2, d2) |
600 |
(a3, d3) |
600 |
where:
Column heading … |
Represents the … |
Col |
column set for which the number of unique values is derived by the derived statistics subsystem. |
NUV |
number of unique values for a column set as derived by the derived statistics subsystem. |