The Optimizer acquires derived statistics in several different ways. The next several topics introduce the more common acquisition methods of deriving statistics.
Note that the description of derived statistics continues with “Discovering Hash and Join Indexes for Derived Statistics” on page 240, which is complex enough to be segregated into its own topic.
If there are single‑table sparse join indexes that either fully or partially cover the single‑table predicates of a relation, the Optimizer can use them to derive the demographics of the residual columns. This mechanism provides an automated way of deriving column correlations using single‑table join indexes.
For example, suppose you have defined the following join index.
CREATE JOIN INDEX ji AS
SELECT c1, d1
FROM t1
WHERE x1 > 10
AND y1 =10;
You then submit the following query.
SELECT t1.c1, COUNT (*)
FROM t1, t2
WHERE t1.d1 = t2.d2
AND t1.x1 > 10
AND t1.y1 = 10
GROUP BY 1;
Using the statistics on ji, the system can derive the demographics of the columns t1.c1 and t1.d1 after applying the single‑table predicates t1.x1 > 10
and t1.y1 = 10
. The newly derived demographics can then be used in join planning for the join column
t1.d1 and in aggregate estimates to determine the number of groups for the grouping column
t1.c1.
Note that partially covering join indexes can also used to derive the demographics in all applicable cases. For example, suppose you define the following join index.
CREATE JOIN INDEX ji AS
SELECT c1, d1
FROM t1
WHERE x1 > 10;
You then submit the following query.
SELECT t1.c1, COUNT (*)
FROM t1, t2
WHERE t1.d1 = t2.d2
AND t1.x1 > 10
AND t1.y1 = 10
GROUP BY 1;
In this case, the demographic information from the partially covering join index is used for columns c1 and d1. If the demographic information is available from both the base table and the join index, then the join index information is given higher precedence because it also captures the correlation information with respect to some single‑table predicates.
The Optimizer can derive the unique values of grouping columns from aggregate join indexes. The derived values can then be used in join planning. This is another strategy to derive the column correlation information among the grouping column set automatically, and also after applying the single‑table predicates.
For example, suppose you have defined the following aggregate join index.
CREATE JOIN INDEX ji AS
SELECT a1, SUM (b1)
FROM t1
WHERE x1 > 10
AND y1 = 10
GROUP BY c1, d1;
You then submit the following query.
SELECT *
FROM t1, t2
WHERE t1.c1 = t2.c2
AND t1.d1 = t2.d2
AND x1 > 10
AND y1 = 10;
The cardinality of the aggregate join index is the number of unique values of the columns (t1.c1, t2.d1). If the unique values for these columns are already available from base table statistics, the system updates them with the information from the aggregate join index; otherwise, it creates new derived statistics entries. The unique values can then be used in join planning for estimating join cardinality, rows per value, skew detection, and so on.
Note that partially covered aggregate join indexes are also handled in the same way as non‑aggregate singletable sparse join indexes, and that the statistics from aggregate join indexes are used automatically if that index replaces a derived table or view.
The Optimizer analyzes all queries for the use of an aggregate join index. In determining if an aggregate join index can be used to process a query, the Optimizer applies the following rules.
SUM(CASE WHEN x1=1
THEN 1
ELSE 0)
CASE WHEN x1=1
THEN SUM(1)
ELSE SUM(0)
This predicate condition … 
Is converted to this equivalent expression … 
expression >= ‘YYYY0101

‘EXTRACT(YEAR FROM expression) >= YYYY’

expression > ‘YYYY1231’

‘EXTRACT(YEAR FROM expression) > YYYY’

expression < ‘YYYY0101’

‘EXTRACT(YEAR FROM expression) < YYYY’

expression <= ‘YYYY1231’

‘EXTRACT(YEAR FROM expression) <= YYYY’

expression >= ‘YYYYMM01’

‘EXTRACT(YEAR FROM expression) > YYYY
OR
(
EXTRACT(YEAR FROM expression) = YYYY AND
EXTRACT(MONTH FROM expression) >= MM
)’

expression > ‘YYYYMMLD’
where LD is the last day of the month specified by MM 
‘EXTRACT(YEAR FROM expression) > YYYY
OR
(
EXTRACT(YEAR FROM expression) = YYYY
AND
EXTRACT(MONTH FROM expression) > MM
)’

expression <= ‘YYYYMMLD’
where LD is the last day of the month specified by MM 
‘EXTRACT(YEAR FROM expression) < YYYY
OR
(
EXTRACT(YEAR FROM expression) =YYYY AND
EXTRACT(MONTH FROM expression) <= MM
)’

expression < ‘YYYYMM01’

‘EXTRACT(YEAR FROM expression) < YYYY
OR
(
EXTRACT(YEAR FROM expression) = YYYY AND
EXTRACT(MONTH FROM expression) < MM
)’

In general, the Optimizer can use an aggregate join index to process a query if the following are true of the index.
The Optimizer uses the following general rules to decide whether and how an aggregate join index can be used to process a query.
For this case, aggregate expressions in the query are mapped to the corresponding aggregate columns in the aggregate join index and the request is rewritten to a simple retrieve from the index as in the following example.
CREATE TABLE t1 (
a1 INTEGER,
b1 INTEGER,
c1 INTEGER,
d1 INTEGER);
CREATE TABLE t2 (
a2 INTEGER,
b2 INTEGER,
c2 INTEGER,
d2 INTEGER);
CREATE JOIN INDEX aji AS
SELECT a1, b1, MIN(c1) AS m
FROM t1
GROUP BY a1, b1;
SELECT a1, MIN(c1)+b1
FROM t1
GROUP BY a1, b1;
is rewritten to
SELECT a1, m+b1
FROM aji;
For this case, the Optimizer rewrites the request to aggregate on the aggregate join index. For example, using the same table definitions as the previous bullet, Teradata Database rewrites the following request as the request that follows.
SELECT a1, MIN(c1)
FROM t1
GROUP BY a1;
SELECT a1, MIN(m)
FROM aji
GROUP BY a1;
For this case, Teradata Database rewrites the request to aggregate on the aggregate join index. For example, using the same table definitions as were used previously, the Optimizer rewrites the following request as the request that follows it.
SELECT a1, b1, MIN(c1)
FROM t1, t2
WHERE b1=b2
GROUP BY a1, b1;
SELECT a1, b1, MIN(m)
FROM aji, t2
WHERE b1=b2
GROUP BY a1, b1;
For this case, Teradata Database rewrites the request to an aggregate join query between the aggregate join index and the remaining table in the original request.
For example, the Optimizer rewrites the following request as the request that follows it.
SELECT a1, a2, MIN(c1)
FROM t1,t2
WHERE b1=b2
GROUP BY a1,a2;
SELECT a1, a2, MIN(m)
FROM aji, t2
WHERE b1=b2
GROUP BY a1, a2;
For example, suppose you define the following aggregate join index aji2.
CREATE JOIN INDEX aji2 AS
SELECT a1,b1, MIN(c1) AS m1, MIN(d1) AS m2
FROM t1
GROUP BY a1,b1;
The Optimizer can rewrite the following SELECT request as the request that follows it.
SELECT a1,b1, MIN(CASE WHEN a1=b1
THEN c1
ELSE d1
END)
FROM t1
GROUP BY 1,2;
SELECT a1,b1, CASE WHEN a1=b1
THEN m1
ELSE m2
END
FROM aji2;
Similarly, the Optimizer can rewrite the following SELECT request as the request that follows it.
SELECT a1, MIN(CASE WHEN a1=b1
THEN c1
ELSE d1
END)
FROM t1
GROUP BY 1;
SELECT a1, MIN(CASE WHEN a1=b1
THEN m1
ELSE m2
END)
FROM aji2
GROUP BY a1;
For example, the Optimizer rewrites the following SELECT request as the SELECT request that follows it.
SELECT a1, MIN(DISTINCT b1)
FROM t1
GROUP BY a1;
SELECT a1, MIN(b1)
FROM aji
GROUP BY a1;
Note: DISTINCT is redundant when used with MIN and MAX, but there is no optimization to process MIN/MAX DISTINCT with one aggregation step. Therefore, the Optimizer uses the aggregate join index to cover the first aggregation of a MIN/MAX DISTINCT operation.
SELECT a1, b1, MIN(c1)
FROM t1
GROUP BY ROLLUP (a1,b1);
For example, consider the following SELECT requests.
SELECT a1, b1, MIN(1)
FROM t1
GROUP BY a1, b1;
SELECT a1, b1, 1
FROM t1
GROUP BY a1, b1;
The Optimizer rewrites both of these requests to use the aggregate join index aji as follows.
SELECT a1, b1, 1
FROM aji;
CHECK and referential integrity constraints can also provide useful information to the Optimizer in the absence of statistics.
For example, consider a column called patient_sex, which can have only 2 values: M and F. Generally, users enforce the sanity of this column with a database CHECK constraint
such as patient_sex IN ("M","F")
.
If there are no statistics on this column and a query specifies the predicate patient_sex = 'M'
, the system would normally assume that 10% of the rows qualify by default because
of the equality condition, but because this column can have only 2 values, a more
reasonable default is 50%.
The Optimizer also considers CHECK constraints such as open or closed ranges, and
IN lists when statistics are not available. Open ranges help in some scenarios when
a query predicate can close the range. For example, if a CHECK constraint specifies
the open range x1 > 10
, and a query specifies the predicate x1 < 20
, their combination produces a closed range that can then be used to derive the column
demographics.
Similarly, if there are no statistics on the child table in a Primary KeyForeign Key relationship, and statistics exist for the parent table PK column, many of its demographics, such as number of unique values, can be derived and used for the FK column in the child table. This information can be used for single‑table estimates and in join planning for join cardinality and rows per value estimates. These kinds of estimates can also be useful for large child tables in those situations where collecting statistics is prohibitively expensive.
This mechanism can be effectively used in conjunction with non‑enforced referential integrity constraints (also known as soft referential integrity constraints) to provide child table demographic information to the Optimizer without collecting statistics.
Note that this mechanism assumes that the child table has all the values of the parent table. If there is only a subset of values, or if the value set in this join column is significantly skewed, then you should collect statistics on the child column to avoid skewed redistributions and underestimates of cardinalities.
The derived statistics framework collects this kind of information from CHECK and referential constraints in the prejoin planning stage of query optimization and propagates them from there to the join planning stage.
The following table definitions are used for the query that illustrates the flow of derived statistics usage by the Optimizer to generate more accurate cardinality estimates.
CREATE TABLE t1 (
a1 INTEGER,
b1 INTEGER,
c1 CHARACTER(5),
d1 DATE);
CREATE TABLE t2 (
a2 INTEGER PRIMARY KEY,
b2 INTEGER,
c2 CHARACTER(1) CHECK (c2 IN ('M', 'F')),
d2 DATE);
CREATE TABLE t4 (
a4 INTEGER REFERENCES t2.a2,
b4 INTEGER,
c4 CHARACTER(1),
d4 DATE );
Given these table definitions, the flow chart that follows shows how the Optimizer uses derived statistics based on CHECK and referential integrity constraints to more accurately estimate cardinalities for the following query using.
SELECT *
FROM t1, t2, t4
WHERE c1 = ‘MSFT’
AND b1 > 30
AND b4 = b1
AND a1 = a4
AND c2 = c4;
The stages of deriving various statistics for this query are as follows.
1 Refine and derive base table statistics using CHECK and referential integrity constraints.
Using this RI constraint to infer that the unique cardinalities must be identical for the 2 columns, the system inserts the number of unique values for t4.a2 into the derived statistics for t4.a4 (see “Statistical Inheritance by Hash and Join Indexes” on page 220 and “Using CHECK Constraints and Referential Integrity Constraints to Derive Column Demographics” on page 227).
2 Join tables t1 and t4, consuming the terms b4=b1 and a1=14 and producing the interim join relation R1.
3 Join interim relation R1 with t2, consuming the term c2=c4 and producing the join relation R2 with the following set of derived statistics cardinalities.
R2 

Column Set 
Number of Unique Values 
(b1) 
1000 
(c1) 
2000 
(a1, a4) 
100 
(b1, b4) 
1000 
(c2, c4) 
2 
where:
Column heading … 
Represents the … 
Col 
column set for which the number of unique values is derived by the derived statistics subsystem. 
NUV 
number of unique values for a column set as derived by the derived statistics subsystem. 
Single‑table estimation logic determines the probable number of qualified unique values and qualified high modal frequency, along with the number of qualified rows, for every predicate. The Optimizer can use this information for cases where a column specified in the single‑table predicate is also specified in subsequent operations such as joins and aggregations.
There are 2 ways to derive this information.
By using the base table statistics, Teradata Database can derive the column demographics using the single‑table predicate on the column specified in the predicate. However, the derived information might not be adjusted later based on other single‑table predicates. This is in line with the assumption of column independence in the absence of column correlation information.
For example, suppose you have the following query.
SELECT *
FROM t1, t2
WHERE t1.d1 = t2.d2
AND t1.d1 BETWEEN '19990101' AND '20000101'
AND t1.c1 > 10;
Using base table statistics, the qualified number of unique values and qualified high modal frequency are derived and saved in the derived statistics entry for column t1.d1.
On the other hand, by using the available single‑table join indexes, the Optimizer can get the demographics of the columns after applying all single‑table predicates.
For example, suppose you have defined the following join index to support the previous query.
CREATE JOIN INDEX ji AS
SELECT t1.d1
FROM t1
WHERE t1.d1 BETWEEN '19990101' AND '20000101'
AND t1.c1 > 10;
If an interval histogram is available for the join index column t1.d1, then it is given higher precedence and captured in the derived statistics because it reflects the demographics of the column t1.d1 after both single‑table predicates have been applied. See “Using Join Index Statistics to Estimate Single‑Table Expression Cardinalities” on page 260 for information about how the Optimizer can use single‑table join index statistics to estimate the cardinalities of complex expressions that reference base table columns.
The derived information is then reused in join planning stages such as join cardinality estimation, rows per value estimation, and skew detection.
In general, the Optimizer considers any predicate that can be used for cardinality estimates for a query plan with the exception of complicated predicates. The following table provides some examples of predicates the Optimizer does consider and the statistics it uses to make cardinality estimates for those predicates.
Predicate 
Definition 
IS NULL 
Self‑defined. 
Self‑defined. 

IS NOT NULL 

TotalRows  Number of Nulls 

Single equality condition 
Self‑defined. 


Index with equality conditions on all columns 
Self‑defined. 
Self‑defined. 

NE 
TotalValues  1 
HighModeFreq of the distribution excluding the HighModeFreq for this value. 


Values count derived from qualified intervals. 
HighModeFreq from the qualified intervals. 

BETWEEN 
LowValues + HighValues  TotalValues 
HighModeFreq from the qualified range intervals. 


The number of elements in the ORed list. 
The maximum frequency of all the ORed elements. 

NOT IN term 
TotalValues  Number of elements in NOT IN list 
The HighMode after excluding all IN list elements. 
An example of a term type that is not considered is ORed terms that reference multiple columns such as x1=10 OR y1>20
.
If single‑table predicates such as EQ, IN lists, ORed predicates or closed ranges, are specified in a query, Teradata Database can derive some information from them, such as the number of unique values, and the derived information can then be used later in join planning if column statistics are not available.
An example of where such statistics would be derived is the following query.
SELECT *
FROM t1, t2
WHERE t1.d1 = t2.d2
AND t1.d1 in (10, 20);
Derived statistics captures and propagates the number of unique values information to the Join Planner based on certain single‑table predicates even when no statistics have been collected on this column.
The demographics of the join columns are required after applying the single‑table predicates to a join request. The Optimizer can use multicolumn statistics to derive the demographics of the join columns using some single‑table predicates. This is another automated way of deriving column correlations using multicolumn statistics.
For example, suppose you submit the following query.
SELECT *
FROM t1, t2
WHERE t1.d1 = t2.d2
AND t1.x1 > 10;
The Join Planner needs to know the demographics of the join column t1.d1 after applying the single‑table predicate t1.x1>10
. Derived statistics derives the demographics of t1.d1 if there are multicolumn statistics (x1, d1) on t1. For PPI tables, if there is row partition elimination and the multicolumn PARTITION
statistics are available, then the demographics of the join columns are derived based
on the qualified partitions.
Note that the order of the columns in multicolumn statistics is important for these estimates. The columns are reordered internally based on their ascending field IDs irrespective of the order you specify for collecting multicolumn statistics. So if column d1 has smaller field ID than column x1, the multicolumn statistics are ordered internally as (d1, x1). In this case, it is not possible to derive the demographics of d1 for a given predicate on x1. Note that for multicolumn PARTITION statistics the PARTITION column is always the first column because it always has an internal field ID of 0.
Some of the single‑table predicates the Optimizer considers for deriving the column
correlations are equality conditions (x1=10
), IN lists (x1 IN (10, 20, 30)
), simple ORed lists on the same columns (x1=10 OR x1=20
), and range predicates (x1 BETWEEN 10 AND 30, x1>10
), and so on.
Also see “Using Join Index Statistics to Estimate Single‑Table Expression Cardinalities” on page 260 for information about how the Optimizer can use single‑table join index statistics to estimate the cardinalities of complex expressions that reference base table columns.
The derived statistics framework discovers hierarchical relationships for denormalized tables using statistics. The relationships are saved as value mappings in the derived statistics. The only mappings used to adjust demographics are those with 1 → n relationships.
A combination of single and multicolumn statistics is required to detect these relationships.
A change to a single column cascades through the entire hierarchical chain. For example, if the value mapping for x → y is 1 → 5 and the value mapping for y→ z is 1 → 10, then if one value is removed from x, 5 values are removed from y, and 10 values are removed from z.
More concretely, if the relationship between region → nation is discovered to be 1 → 5, then if one region is selected, the Optimizer detects that only 5 nations qualify.
In other words, if one region is disqualified by a single‑table predicate, the Optimizer removes 5 nations from the nation column.
The following example illustrates this logic. Consider the denormalized dimension table customer_nation_region, which is defined as follows.
CREATE SET TABLE cust_nation_region (
c_custkey INTEGER,
c_name VARCHAR(25) CHARACTER SET LATIN CASESPECIFIC,
c_address VARCHAR(40) CHARACTER SET LATIN CASESPECIFIC,
c_nationkey INTEGER,
c_phone CHAR(15) CHARACTER SET LATIN CASESPECIFIC,
c_acctbal DECIMAL(15,2),
c_mktsegment CHAR(10) CHARACTER SET LATIN CASESPECIFIC,
c_comment VARCHAR(117) CHARACTER SET LATIN CASESPECIFIC,
c_maritalstatus CHAR(1) CHARACTER SET LATIN NOT CASESPECIFIC,
n_name CHAR(25) CHARACTER SET LATIN CASESPECIFIC,
n_comment VARCHAR(152) CHARACTER SET LATIN CASESPECIFIC,
r_regionkey INTEGER,
r_name CHAR(25) CHARACTER SET LATIN CASESPECIFIC,
r_comment VARCHAR(152) CHARACTER SET LATIN CASESPECIFIC)
PRIMARY INDEX (c_custkey);
In this example, the relationships are derived from a combination of single‑ and multicolumn statistics. The following statistics are required to discover the relationships.
The relationships discovered are.
You can see an example of this in the following query, which specifies a predicate that selects only one regionkey value, and its EXPLAIN text.
EXPLAIN SELECT n_nationkey
FROM cust_nation_region
WHERE r_regionkey = 1
GROUP BY 1;
Explanation

1) First, we lock a distinct TPCD_OCES3."pseudo table" for read on a
RowHash to prevent global deadlock for TPCD_OCES3.cust_nation_region.
2) Next, we lock TPCD_OCES3.cust_nation_region for read.3) We do an allAMPs SUM step to aggregate from TPCD_OCES3.cust_nation_region by way of an allrows
scan with a condition of ("TPCD_OCES3.cust_nation_region.R_REGIONKEY = 1"),
grouping by field1 ( TPCD_OCES3.cust_nation_region.N_NATIONKEY).
Aggregate Intermediate Results are computed globally, then placed
in Spool 3. The size of Spool 3 is estimated with low confidence
to be 5 rows (525 bytes). The estimated time for this step is
0.62 seconds.
> The total estimated time is 0.63 seconds.
From the known hierarchical relationship between region and nation, which is 1 → 5, the predicted number of nationkey rows is 5, which is verified by the EXPLAIN text highlighted in boldface type. By generalizing, if you were to write a request that selected 2 regionkey values, there would be 10 nationkey rows in the result set, and so on.
If there are multiple join predicates, and the join columns have some degree of correlation, multicolumn statistics are necessary to more accurately estimate join cardinality, rows per value, skew detection, and related statistics. Without having multicolumn statistics, and assuming column independence, the Optimizer multiplies the individual numbers of unique values to get an estimate of the combined number of unique values. However, if multicolumn statistics exist that are a superset of the join columns, they might provide more useful information about the column correlations than the column independence assumption would provide.
For example, consider the following query.
SELECT *
FROM t1, t2
WHERE t1.d1 = t2.d2
AND t1.x1 = t2.x2;
The system requires multicolumn statistics on (t1.d1, t1.x1) and (t2.d2, t2.x2) to be able to make accurate cardinality and costing estimates. If there are individual
statistics on t1.d1, with 100 unique values, and t1.x1, with 50 unique values, and assuming column independence, the unique values are multiplied
to calculate the combined maximum number of unique values, 100 * 50 = 5,000
, and having an upward bound of the total number of rows.
The derived statistics subsystem identifies such opportunities in the prejoin planning stage and derives dummy multicolumn statistics based on super sets. For example, if there is a multicolumn statistics collection that is a superset, such as (x1, y1, d1), with only 2,000 unique values, the Optimizer can cap the estimated unique value cardinality of 5,000 to a more accurate 2,000 by deriving a derived statistics entry (x1, y1) with 2,000 unique values, and then propagate the statistic to Join Planning.
Similarly, if there are individual entries and multicolumn statistics entries such
as (x1), (d1) and (x1, d1) and single‑table predicates such as x1 IN (1,2) AND d1 < 20
, the Optimizer can update the demographics of the single‑column entries while making
its single‑table estimates. Based on the updated information from the single‑column
entries, the multicolumn entry of derived statistics is updated and propagated.
The Optimizer assumes join uniformity when it estimates join cardinalities and costs. In other words, it assumes that every value from the left relation in the join finds a match in the right relation if the number of values in the left relation is less than the number of right values in the right relation, and vice versa.
Based on this assumption, and given the join predicate t1.d1 = t2.d2
, if t1.d1 has 100 unique values and t2.d2 has 50 unique values, then after the join, t1.d1 should have only 50 unique values. The Optimizer also considers multicolumn statistics
for making adjustments. For example, given the following predicate and (x1, y1) with 100 unique values and (x2, y2) with 50 unique values, after the join, the number of (x1, y1) unique values is adjusted to 50.
t1.x1=t2.x2 AND t1.y1=t2.y2
The derived statistics subsystem performs this kind of analysis and adjusts the demographics of the join columns such as the number of unique values and the high modal frequency after each binary join and then propagates the adjusted demographic information to the next stages of the join planning process.
If column correlation information is available, the Optimizer uses it to adjust the other column demographics based on the adjustments to the join column. Otherwise, the join column is assumed to be independent of the other columns.
Assume that t1.d1 has 100 unique values, t2.d2 has 50 unique values, and t3.d3 has 200 unique values.
Suppose you submit the following query.
SELECT t3.d3, COUNT (*)
FROM t1, t2, t3
WHERE t1.d1 = t2.d2
AND t1.d1 = t3.d3
GROUP BY 1;
Assuming the join order is (t1 x t2) x t3, after the first join, the derived statistics subsystem adjusts the number of unique values of t1.d1 to the minimum number of unique values (t1.d1, t2.d2) = 50.
After the second join, the derived statistics subsystem adjusts the number of unique values of t3.d3 to 50, which is then used in the final aggregate estimate because t3.d3 is specified as a grouping column.
Some join predicates can become redundant after the system has made one or more binary joins. A predicate is redundant when it adds nothing to the overall selectivity for a query because its effect is equivalent to one or more other predicates that are also specified in that query.
The redundancy can result from either transitive closure or from user‑defined query conditions. If such redundant predicates are not properly identified and handled during cardinality estimates and costing, they can have a negative effect on the optimization of the query, possibly leading to non‑optimal query plans.
For example, suppose you submit the following query.
SELECT *
FROM t1, t2, t3
WHERE t1.d1 = t2.d2
AND t1.d1 = t3.d3;
Transitive closure (see “Predicate Simplification” on page 94) derives a new predicate t2.d2 = t3.d3
for this request. Assuming the join order (t1 x t2→ j1) x t3, the join predicates for the second join are j1.d1 = t3.d3
and j1.d2 = t3.d3
.
While estimating the number of unique values of (d1,d2) for join j1, the Optimizer needs to be aware that these 2 columns had already been equated in the previous join, and it should not multiply their individual unique values to get the combined number of unique values. Instead, it should use MIN(j1.d1, j1.d2) as the combined number of unique values.
The derived statistics infrastructure builds the appropriate entries by combining all the connected joined columns into EquiSets after every binary join. This way, the subsequent joins seamlessly handle the redundancy.
In join planning, an EquiSet is a set of columns that was equated in predicates from a previous join operation in the same query. Propagating EquiSets to subsequent join operations for reuse in a query is a fundamental component of the derived statistics framework.
For example, in the preceding example, an EquiSet derived statistics entry (j1.d1, j1.d2) exists after join j1, with the minimum number of values of d1 and d2. When the subsequent join requests the unique values for the combination (j1.d1, j1.d2) the Optimizer automatically uses the existing EquiSet entry.
As demonstrated in the preceding topics, information such as the number of unique values and the high modal frequency can be taken or derived from multiple sources such as single‑table join indexes, multicolumn statistics, dynamic AMP samples, and interval histograms, among others.
It is not trivial to determine which of the available sources is likely to be the most useful or trustworthy. The following criteria all play an important role in making the determination.
For example, it is possible to have a join index producing non‑correlated demographic information covering multiple singletable predicates, while at the same time there might be a multicolumn statistic covering one singletable predicate that produces highly correlated information for a join column.
Because of issues such as this, it is not possible to define precedence rules based on heuristics. To work around this restriction, the Optimizer quantifies the information and defines precedence dynamically based on the derived quantities.
This translates to using the information source that provides the fewest unique values.
Column demographics are propagated to all temporary and committed joins using the derived statistics framework. As a result, demographics can be dynamically adjusted after each join, and the Join Planner does not need to reference base table interval histograms to retrieve the statistics.
If you specify complex predicates or outer joins in a query, they might become materialized into spool relations while the Optimizer is performing join planning within a query block.
Suppose you submit the following outer join query with a complex ON clause predicate.
SELECT *
FROM t1 INNER JOIN t2 ON (x1=x2)
LEFT OUTER JOIN t3 ON (x2 NOT IN (SELECT x4
FROM t4)
OR y3=10);
The ON clause predicate x2 NOT IN (SELECT x4 FROM t4) OR y3=10
in this query makes this a complex outer join (if a subquery is specified in an ON
clause, the predicate is classified as either semicomplex or complex).
One of the requirements for processing a complex outer join is to have only one left relation, so the left relations (t1, t2) are materialized into a single spool relation before the system processes the complex join. After materializing the left relations, derived statistics propagates the projected column demographic information for the materialized instance to the subsequent stages of join planning.
To be able to do aggregate and orderedanalytic cardinality estimates after a series of joins, the demographic information of the grouping columns needs to be propagated through the join operations until final aggregations are done. The derived statistics subsystem adjusts the demographic information based on single‑table and join predicates and then propagates it forward until final aggregation completes.
For example, suppose you submit the following query.
SELECT t1.x1, t2.x2, COUNT (*)
FROM t1, t2
WHERE t1.x1 =10
AND t1.y2 = t2.x2
GROUP BY 1, 2;
In the prejoin planning stage, derived statistics detects the single‑table equality
term on column x1 (t1.x1=10
) and adjusts the number of unique values to 1. Also while doing join planning, the
number of unique values of t2.x2 is adjusted to the minimum number of values of (t1.y2, t2.x2). This improves the aggregate cardinality estimates, which can benefit the overall
plan if this query block is a spooled derived table or a view.
Suppose you submit the following query.
SELECT *
FROM t1, (SELECT x2, x3, COUNT(*)
FROM t2, t3
WHERE x2=10
AND t2.x3=t3.x3
GROUP BY 1, 2) AS dt
WHERE t1.x1=dt.x3;
The derived statistics framework carries the column demographic information of the derived table dt for columns x2 and x3 by making adjustments based on the derived table predicates. The adjusted demographics are then propagated to the join planning for the outer block.