15.10 - Deriving Column Demographics - Teradata Database

Teradata Database SQL Request and Transaction Processing

prodname
vrm_release
15.10
category
Programming Reference
User Guide
featnum
B035-1142-151K

The Optimizer acquires derived statistics in several different ways. The next several topics introduce the more common acquisition methods of deriving statistics.

Note that the description of derived statistics continues with “Discovering Hash and Join Indexes for Derived Statistics” on page 240, which is complex enough to be segregated into its own topic.

If there are single‑table sparse join indexes that either fully or partially cover the single‑table predicates of a relation, the Optimizer can use them to derive the demographics of the residual columns. This mechanism provides an automated way of deriving column correlations using single‑table join indexes.

For example, suppose you have defined the following join index.

`     CREATE JOIN INDEX ji AS `
`      SELECT c1, d1 `
`      FROM t1 `
`      WHERE x1 > 10 `
`      AND   y1 =10;`

You then submit the following query.

`     SELECT t1.c1, COUNT (*)`
`     FROM t1, t2 `
`     WHERE t1.d1 = t2.d2 `
`     AND   t1.x1 > 10 `
`     AND   t1.y1 = 10 `
`     GROUP BY 1;`

Using the statistics on ji, the system can derive the demographics of the columns t1.c1 and t1.d1 after applying the single‑table predicates `t1.x1 > 10` and `t1.y1 = 10`. The newly derived demographics can then be used in join planning for the join column t1.d1 and in aggregate estimates to determine the number of groups for the grouping column t1.c1.

Note that partially covering join indexes can also used to derive the demographics in all applicable cases. For example, suppose you define the following join index.

`     CREATE JOIN INDEX ji AS`
`       SELECT c1, d1 `
`       FROM t1 `
`       WHERE x1 > 10; `

You then submit the following query.

`     SELECT t1.c1, COUNT (*) `
`     FROM t1, t2 `
`     WHERE t1.d1 = t2.d2 `
`     AND   t1.x1 > 10 `
`     AND   t1.y1 = 10 `
`     GROUP BY 1;`

In this case, the demographic information from the partially covering join index is used for columns c1 and d1. If the demographic information is available from both the base table and the join index, then the join index information is given higher precedence because it also captures the correlation information with respect to some single‑table predicates.

The Optimizer can derive the unique values of grouping columns from aggregate join indexes. The derived values can then be used in join planning. This is another strategy to derive the column correlation information among the grouping column set automatically, and also after applying the single‑table predicates.

For example, suppose you have defined the following aggregate join index.

`     CREATE JOIN INDEX ji AS`
`       SELECT a1, SUM (b1) `
`       FROM t1 `
`       WHERE x1 > 10 `
`       AND y1 = 10 `
`       GROUP BY c1, d1; `

You then submit the following query.

`     SELECT * `
`     FROM t1, t2 `
`     WHERE t1.c1 = t2.c2 `
`     AND   t1.d1 = t2.d2 `
`     AND   x1 > 10 `
`     AND   y1 = 10;`

The cardinality of the aggregate join index is the number of unique values of the columns (t1.c1, t2.d1). If the unique values for these columns are already available from base table statistics, the system updates them with the information from the aggregate join index; otherwise, it creates new derived statistics entries. The unique values can then be used in join planning for estimating join cardinality, rows per value, skew detection, and so on.

Note that partially covered aggregate join indexes are also handled in the same way as non‑aggregate single-table sparse join indexes, and that the statistics from aggregate join indexes are used automatically if that index replaces a derived table or view.

The Optimizer analyzes all queries for the use of an aggregate join index. In determining if an aggregate join index can be used to process a query, the Optimizer applies the following rules.

• An aggregate join index cannot be used to substitute for the base tables of a non-aggregate query.
• Like a non-aggregate join index, an aggregate join index qualifies for inclusion in the plan if it specifies either the same joins or a subset of the joins of the query.
• If an aggregate join index specifies a subset of joins of the query, then all the joined columns of the remaining join conditions on the tables of the aggregate join index must be specified in the GROUP BY clause and in the select list of the aggregate join index definition.
• Even if an aggregate join index specifies the same joins as the query, it does not qualify if its GROUP BY columns are neither the same nor a superset of the GROUP BY columns for the query.
• An aggregate join index must contain all the columns needed by the query for the tables that it replaces. When trying to match expressions from a query to the select list for a join index, the Optimizer applies the following rules.
• Addition and multiplication are commutative.
• COUNT(*) and COUNT(non-null_expression) are used interchangeably.
• A numeric expression from the join index that is converted to FLOAT can be used to match any expression of the same type defined with or without the conversion from the query.
• An AVERAGE aggregate expression in a query is converted to a SUM/COUNT expression when an aggregate join index substitutes for it.
• A SUM(0) expression in a query is converted to a constant having value 0.
• A SUM(constant) expression in a query matches with constant * COUNT(*) in a aggregate join index definition.
• The SUM or COUNT function from SUM(CASE expression) or COUNT(CASE expression) can be pushed to all the resulting expressions of the CASE expression in order to find a match. For example, the following expression can be converted to the expression following it in order to find a match.
• `      SUM(CASE WHEN x1=1 `
`               THEN 1 `
`               ELSE 0) `
`      `
`      CASE WHEN x1=1 `
`      THEN SUM(1) `
`      ELSE SUM(0)`
• Predicate conditions are converted according to the following conversion table.
•
 This predicate condition … Is converted to this equivalent expression … `expression >= ‘YYYY0101` `‘EXTRACT(YEAR FROM expression) >= YYYY’` `expression > ‘YYYY1231’` `‘EXTRACT(YEAR FROM expression) > YYYY’` `expression < ‘YYYY0101’` `‘EXTRACT(YEAR FROM expression) < YYYY’` `expression <= ‘YYYY1231’` `‘EXTRACT(YEAR FROM expression) <= YYYY’` `expression >= ‘YYYYMM01’` `‘EXTRACT(YEAR FROM expression) > YYYY ` `OR ` `(` `EXTRACT(YEAR FROM expression) = YYYY AND ` `EXTRACT(MONTH FROM expression) >= MM` `)’` `expression > ‘YYYYMMLD’` where LD is the last day of the month specified by MM `‘EXTRACT(YEAR FROM expression) > YYYY ` `OR ` `(` `EXTRACT(YEAR FROM expression) = YYYY ` `AND ` `EXTRACT(MONTH FROM expression) > MM` `)’` `expression <= ‘YYYYMMLD’` where LD is the last day of the month specified by MM `‘EXTRACT(YEAR FROM expression) < YYYY ` `OR ` `(` `EXTRACT(YEAR FROM expression) =YYYY AND ` `EXTRACT(MONTH FROM expression) <= MM` `)’` `expression < ‘YYYYMM01’` `‘EXTRACT(YEAR FROM expression) < YYYY ` `OR ` `(` `EXTRACT(YEAR FROM expression) = YYYY AND ` `EXTRACT(MONTH FROM expression) < MM` `)’`

In general, the Optimizer can use an aggregate join index to process a query if the following are true of the index.

• It satisfies the join index coverage rules.
• It contains the aggregate expressions specified by the request.
• It contains the column expressions specified in the GROUP BY clause for the request.
• The Optimizer uses the following general rules to decide whether and how an aggregate join index can be used to process a query.

• The aggregate join index has the same joins as the query or it is a broad aggregate join index that covers all the joins in the query.
• The GROUP BY clause of the aggregate join index specifies the same column expressions as the GROUP BY clause of the request.
• For this case, aggregate expressions in the query are mapped to the corresponding aggregate columns in the aggregate join index and the request is rewritten to a simple retrieve from the index as in the following example.

`     CREATE TABLE t1 (`
`       a1 INTEGER, `
`       b1 INTEGER, `
`       c1 INTEGER, `
`       d1 INTEGER);`
` `
`     CREATE TABLE t2 (`
`       a2 INTEGER, `
`       b2 INTEGER, `
`       c2 INTEGER, `
`       d2 INTEGER);`
` `
`     CREATE JOIN INDEX aji AS `
`       SELECT a1, b1, MIN(c1) AS m `
`       FROM t1`
`       GROUP BY a1, b1;`
` `
`     SELECT a1, MIN(c1)+b1 `
`     FROM t1 `
`     GROUP BY a1, b1;`

is rewritten to

`     SELECT a1, m+b1 `
`     FROM aji;`
• The GROUP BY clause of the aggregate join index specifies more column expressions than the GROUP BY clause of the request.
• For this case, the Optimizer rewrites the request to aggregate on the aggregate join index. For example, using the same table definitions as the previous bullet, Teradata Database rewrites the following request as the request that follows.

`     SELECT a1, MIN(c1) `
`     FROM t1 `
`     GROUP BY a1;`
` `
`     SELECT a1, MIN(m) `
`     FROM aji `
`     GROUP BY a1;`
• The aggregate join index has fewer joins than the query and contains the columns needed to join with the remaining tables in the request.
• The GROUP BY clause of the aggregate join index specifies the same column expressions as the GROUP BY clause of the request on the table covered by the index.
• For this case, Teradata Database rewrites the request to aggregate on the aggregate join index. For example, using the same table definitions as were used previously, the Optimizer rewrites the following request as the request that follows it.

`     SELECT a1, b1, MIN(c1) `
`     FROM t1, t2 `
`     WHERE b1=b2 `
`     GROUP BY a1, b1;`
` `
`     SELECT a1, b1, MIN(m) `
`     FROM aji, t2 `
`     WHERE b1=b2 `
`     GROUP BY a1, b1;`
• The GROUP BY clause of the aggregate join index specifies more column expressions than the GROUP BY clause of the request on the table covered by the aggregate join index.
• For this case, Teradata Database rewrites the request to an aggregate join query between the aggregate join index and the remaining table in the original request.

For example, the Optimizer rewrites the following request as the request that follows it.

`     SELECT a1, a2, MIN(c1) `
`     FROM t1,t2 `
`     WHERE b1=b2 `
`     GROUP BY a1,a2;`
` `
`     SELECT a1, a2, MIN(m) `
`     FROM aji, t2 `
`     WHERE b1=b2 `
`     GROUP BY a1, a2;`
• The Optimizer can use an aggregate expression in an aggregate join index to rewrite the aggregate of a CASE expression in a request.
• For example, suppose you define the following aggregate join index aji2.

`     CREATE JOIN INDEX aji2 AS`
`       SELECT a1,b1, MIN(c1) AS m1, MIN(d1) AS m2  `
`       FROM t1 `
`       GROUP BY a1,b1;`

The Optimizer can rewrite the following SELECT request as the request that follows it.

`     SELECT a1,b1, MIN(CASE WHEN a1=b1 `
`                            THEN c1 `
`                            ELSE d1 `
`                       END)`
`     FROM t1 `
`     GROUP BY 1,2;`
` `
`     SELECT a1,b1, CASE WHEN a1=b1 `
`                        THEN m1 `
`                        ELSE m2 `
`                   END `
`     FROM aji2;`

Similarly, the Optimizer can rewrite the following SELECT request as the request that follows it.

`     SELECT a1, MIN(CASE WHEN a1=b1 `
`                         THEN c1 `
`                         ELSE d1 `
`                    END) `
`     FROM t1 `
`     GROUP BY 1;`
` `
`     SELECT a1, MIN(CASE WHEN a1=b1`
`                         THEN m1 `
`                         ELSE m2 `
`                    END) `
`     FROM aji2 `
`     GROUP BY a1;`
• The aggregate join index can be used to process the first aggregation of COUNT/SUM/MIN/MAX(DISTINCT).
• For example, the Optimizer rewrites the following SELECT request as the SELECT request that follows it.

`     SELECT a1, MIN(DISTINCT b1) `
`     FROM t1 `
`     GROUP BY a1;`
` `
`     SELECT a1, MIN(b1) `
`     FROM aji `
`     GROUP BY a1;`

Note: DISTINCT is redundant when used with MIN and MAX, but there is no optimization to process MIN/MAX DISTINCT with one aggregation step. Therefore, the Optimizer uses the aggregate join index to cover the first aggregation of a MIN/MAX DISTINCT operation.

• The Optimizer can use an aggregate join index to process the first level of aggregation of a query with extended grouping. For example, the following SELECT request can use an aggregate join index to process the first level of aggregation (a1,b1) while the next level of aggregation (a1) and (b1) is processed on top of the result spool of the first level of aggregation.
• `     SELECT a1, b1, MIN(c1) `
`     FROM t1 `
`     GROUP BY ROLLUP (a1,b1);`
• The Optimizer can use an aggregate join index to process aggregations involving constants.
• For example, consider the following SELECT requests.

`     SELECT a1, b1, MIN(1) `
`     FROM t1 `
`     GROUP BY a1, b1;`
` `
`     SELECT a1, b1, 1 `
`     FROM t1 `
`     GROUP BY a1, b1;`

The Optimizer rewrites both of these requests to use the aggregate join index aji as follows.

`     SELECT a1, b1, 1 `
`     FROM aji;`

CHECK and referential integrity constraints can also provide useful information to the Optimizer in the absence of statistics.

For example, consider a column called patient_sex, which can have only 2 values: M and F. Generally, users enforce the sanity of this column with a database CHECK constraint such as `patient_sex IN ("M","F")`.

If there are no statistics on this column and a query specifies the predicate `patient_sex = 'M'`, the system would normally assume that 10% of the rows qualify by default because of the equality condition, but because this column can have only 2 values, a more reasonable default is 50%.

The Optimizer also considers CHECK constraints such as open or closed ranges, and IN lists when statistics are not available. Open ranges help in some scenarios when a query predicate can close the range. For example, if a CHECK constraint specifies the open range `x1 > 10`, and a query specifies the predicate `x1 < 20`, their combination produces a closed range that can then be used to derive the column demographics.

Similarly, if there are no statistics on the child table in a Primary Key-Foreign Key relationship, and statistics exist for the parent table PK column, many of its demographics, such as number of unique values, can be derived and used for the FK column in the child table. This information can be used for single‑table estimates and in join planning for join cardinality and rows per value estimates. These kinds of estimates can also be useful for large child tables in those situations where collecting statistics is prohibitively expensive.

This mechanism can be effectively used in conjunction with non‑enforced referential integrity constraints (also known as soft referential integrity constraints) to provide child table demographic information to the Optimizer without collecting statistics.

Note that this mechanism assumes that the child table has all the values of the parent table. If there is only a subset of values, or if the value set in this join column is significantly skewed, then you should collect statistics on the child column to avoid skewed redistributions and underestimates of cardinalities.

The derived statistics framework collects this kind of information from CHECK and referential constraints in the prejoin planning stage of query optimization and propagates them from there to the join planning stage.

The following table definitions are used for the query that illustrates the flow of derived statistics usage by the Optimizer to generate more accurate cardinality estimates.

`     CREATE TABLE t1 (`
`       a1 INTEGER, `
`       b1 INTEGER, `
`       c1 CHARACTER(5), `
`       d1 DATE);`
` `
`     CREATE TABLE t2 (`
`       a2 INTEGER PRIMARY KEY, `
`       b2 INTEGER, `
`       c2 CHARACTER(1) CHECK (c2 IN ('M', 'F')), `
`       d2 DATE);`
` `
`     CREATE TABLE t4 (`
`       a4 INTEGER REFERENCES t2.a2, `
`       b4 INTEGER, `
`       c4 CHARACTER(1), `
`       d4 DATE );`

Given these table definitions, the flow chart that follows shows how the Optimizer uses derived statistics based on CHECK and referential integrity constraints to more accurately estimate cardinalities for the following query using.

`     SELECT *`
`     FROM t1, t2, t4`
`     WHERE c1 = ‘MSFT’`
`     AND   b1 > 30`
`     AND   b4 = b1`
`     AND   a1 = a4`
`     AND   c2 = c4;`

The stages of deriving various statistics for this query are as follows.

1 Refine and derive base table statistics using CHECK and referential integrity constraints.

• No statistics have been collected on t4.a4, but statistics have been collected on t4.a2, and there is a referential integrity constraint from t4.a2 to t4.a4.
• Using this RI constraint to infer that the unique cardinalities must be identical for the 2 columns, the system inserts the number of unique values for t4.a2 into the derived statistics for t4.a4 (see “Statistical Inheritance by Hash and Join Indexes” on page 220 and “Using CHECK Constraints and Referential Integrity Constraints to Derive Column Demographics” on page 227).

• No statistics have been collected on t2.c2, but t2.c2 has a CHECK constraint that limits its possible values to only 2 (M or F), so the system inserts that value into the derived statistics for t2.c2 (see “Using CHECK Constraints and Referential Integrity Constraints to Derive Column Demographics” on page 227).
• 2 Join tables t1 and t4, consuming the terms b4=b1 and a1=14 and producing the interim join relation R1.

3 Join interim relation R1 with t2, consuming the term c2=c4 and producing the join relation R2 with the following set of derived statistics cardinalities.

 R2 Column Set Number of Unique Values (b1) 1000 (c1) 2000 (a1, a4) 100 (b1, b4) 1000 (c2, c4) 2

where:

 Column heading … Represents the … Col column set for which the number of unique values is derived by the derived statistics subsystem. NUV number of unique values for a column set as derived by the derived statistics subsystem.

Single‑table estimation logic determines the probable number of qualified unique values and qualified high modal frequency, along with the number of qualified rows, for every predicate. The Optimizer can use this information for cases where a column specified in the single‑table predicate is also specified in subsequent operations such as joins and aggregations.

There are 2 ways to derive this information.

• Using the base table statistics
• Using the statistics on a single‑table join index
• By using the base table statistics, Teradata Database can derive the column demographics using the single‑table predicate on the column specified in the predicate. However, the derived information might not be adjusted later based on other single‑table predicates. This is in line with the assumption of column independence in the absence of column correlation information.

For example, suppose you have the following query.

`     SELECT * `
`     FROM t1, t2 `
`     WHERE   t1.d1 = t2.d2 `
`     AND     t1.d1 BETWEEN '1999-01-01' AND '2000-01-01' `
`     AND     t1.c1 > 10;`

Using base table statistics, the qualified number of unique values and qualified high modal frequency are derived and saved in the derived statistics entry for column t1.d1.

On the other hand, by using the available single‑table join indexes, the Optimizer can get the demographics of the columns after applying all single‑table predicates.

For example, suppose you have defined the following join index to support the previous query.

`     CREATE JOIN INDEX ji AS`
`       SELECT t1.d1 `
`       FROM t1 `
`       WHERE   t1.d1 BETWEEN '1999-01-01' AND '2000-01-01' `
`       AND     t1.c1 > 10;`

If an interval histogram is available for the join index column t1.d1, then it is given higher precedence and captured in the derived statistics because it reflects the demographics of the column t1.d1 after both single‑table predicates have been applied. See “Using Join Index Statistics to Estimate Single‑Table Expression Cardinalities” on page 260 for information about how the Optimizer can use single‑table join index statistics to estimate the cardinalities of complex expressions that reference base table columns.

The derived information is then reused in join planning stages such as join cardinality estimation, rows per value estimation, and skew detection.

In general, the Optimizer considers any predicate that can be used for cardinality estimates for a query plan with the exception of complicated predicates. The following table provides some examples of predicates the Optimizer does consider and the statistics it uses to make cardinality estimates for those predicates.

 Predicate Definition IS NULL Self‑defined. Self‑defined. IS NOT NULL If nulls are present, the value is calculated as (TotalValues - 1). If no nulls are present, the value is just TotalValues. TotalRows - Number of Nulls Single equality condition Self‑defined. If Loner/Mode is in any interval, the value is its frequency. If Loner/Mode is not in any interval, the value is calculated as . Index with equality conditions on all columns Self‑defined. Self‑defined. NE TotalValues - 1 HighModeFreq of the distribution excluding the HighModeFreq for this value. LT LE GT GE Values count derived from qualified intervals. HighModeFreq from the qualified intervals. BETWEEN LowValues + HighValues - TotalValues HighModeFreq from the qualified range intervals. ORed term IN list The number of elements in the ORed list. The maximum frequency of all the ORed elements. NOT IN term TotalValues - Number of elements in NOT IN list The HighMode after excluding all IN list elements.

An example of a term type that is not considered is ORed terms that reference multiple columns such as `x1=10 OR y1>20`.

If single‑table predicates such as EQ, IN lists, ORed predicates or closed ranges, are specified in a query, Teradata Database can derive some information from them, such as the number of unique values, and the derived information can then be used later in join planning if column statistics are not available.

An example of where such statistics would be derived is the following query.

`     SELECT * `
`     FROM t1, t2 `
`     WHERE t1.d1 = t2.d2 `
`     AND   t1.d1 in (10, 20);`

Derived statistics captures and propagates the number of unique values information to the Join Planner based on certain single‑table predicates even when no statistics have been collected on this column.

The demographics of the join columns are required after applying the single‑table predicates to a join request. The Optimizer can use multicolumn statistics to derive the demographics of the join columns using some single‑table predicates. This is another automated way of deriving column correlations using multicolumn statistics.

For example, suppose you submit the following query.

`     SELECT * `
`     FROM t1, t2 `
`     WHERE t1.d1 = t2.d2 `
`     AND   t1.x1 > 10;`

The Join Planner needs to know the demographics of the join column t1.d1 after applying the single‑table predicate `t1.x1>10`. Derived statistics derives the demographics of t1.d1 if there are multicolumn statistics (x1, d1) on t1. For PPI tables, if there is row partition elimination and the multicolumn PARTITION statistics are available, then the demographics of the join columns are derived based on the qualified partitions.

Note that the order of the columns in multicolumn statistics is important for these estimates. The columns are reordered internally based on their ascending field IDs irrespective of the order you specify for collecting multicolumn statistics. So if column d1 has smaller field ID than column x1, the multicolumn statistics are ordered internally as (d1, x1). In this case, it is not possible to derive the demographics of d1 for a given predicate on x1. Note that for multicolumn PARTITION statistics the PARTITION column is always the first column because it always has an internal field ID of 0.

Some of the single‑table predicates the Optimizer considers for deriving the column correlations are equality conditions (`x1=10`), IN lists (`x1 IN (10, 20, 30)`), simple ORed lists on the same columns (`x1=10 OR x1=20`), and range predicates (`x1 BETWEEN 10 AND 30, x1>10`), and so on.

Also see “Using Join Index Statistics to Estimate Single‑Table Expression Cardinalities” on page 260 for information about how the Optimizer can use single‑table join index statistics to estimate the cardinalities of complex expressions that reference base table columns.

The derived statistics framework discovers hierarchical relationships for denormalized tables using statistics. The relationships are saved as value mappings in the derived statistics. The only mappings used to adjust demographics are those with 1   n relationships.

A combination of single and multicolumn statistics is required to detect these relationships.

A change to a single column cascades through the entire hierarchical chain. For example, if the value mapping for x   y is 1   5 and the value mapping for y  z is 1   10, then if one value is removed from x, 5 values are removed from y, and 10 values are removed from z.

More concretely, if the relationship between region   nation is discovered to be 1   5, then if one region is selected, the Optimizer detects that only 5 nations qualify.

In other words, if one region is disqualified by a single‑table predicate, the Optimizer removes 5 nations from the nation column.

The following example illustrates this logic. Consider the denormalized dimension table customer_nation_region, which is defined as follows.

`     CREATE SET TABLE cust_nation_region (`
`       c_custkey       INTEGER,`
`       c_name          VARCHAR(25) CHARACTER SET LATIN CASESPECIFIC,`
`       c_address       VARCHAR(40) CHARACTER SET LATIN CASESPECIFIC,`
`       c_nationkey     INTEGER,`
`       c_phone         CHAR(15) CHARACTER SET LATIN CASESPECIFIC,`
`       c_acctbal       DECIMAL(15,2),`
`       c_mktsegment    CHAR(10) CHARACTER SET LATIN CASESPECIFIC,`
`       c_comment       VARCHAR(117) CHARACTER SET LATIN CASESPECIFIC,`
`       c_maritalstatus CHAR(1) CHARACTER SET LATIN NOT CASESPECIFIC,`
`       n_name          CHAR(25) CHARACTER SET LATIN CASESPECIFIC,`
`       n_comment       VARCHAR(152) CHARACTER SET LATIN CASESPECIFIC,`
`       r_regionkey     INTEGER,`
`       r_name          CHAR(25) CHARACTER SET LATIN CASESPECIFIC,`
`       r_comment       VARCHAR(152) CHARACTER SET LATIN CASESPECIFIC)`
`     PRIMARY INDEX (c_custkey);`

In this example, the relationships are derived from a combination of single‑ and multicolumn statistics. The following statistics are required to discover the relationships.

• Single‑column on r_regionkey
• Single‑column on r_regionkey
• Single‑column on r_regionkey
• Multicolumn on (r_regionkey, n_nationkey)
• Multicolumn on (n_nationkey, c_custkey)
• The relationships discovered are.

• region   nation is 1   5
• nation   customer is 1   24,000
• You can see an example of this in the following query, which specifies a predicate that selects only one regionkey value, and its EXPLAIN text.

`     EXPLAIN SELECT n_nationkey `
`     FROM cust_nation_region `
`     WHERE r_regionkey = 1 `
`     GROUP BY 1;`
` `
`     Explanation`
`     -----------------------------------------------------------------------`
`     1) First, we lock a distinct TPCD_OCES3."pseudo table" for read on a`
`        RowHash to prevent global deadlock for TPCD_OCES3.cust_nation_region.`
`     2) Next, we lock TPCD_OCES3.cust_nation_region for read.3) We do an all-AMPs SUM         step to aggregate from TPCD_OCES3.cust_nation_region by way of an all-rows`
`        scan with a condition of ("TPCD_OCES3.cust_nation_region.R_REGIONKEY = 1"),`
`        grouping by field1 ( TPCD_OCES3.cust_nation_region.N_NATIONKEY).`
`        Aggregate Intermediate Results are computed globally, then placed`
`        in Spool 3.  The size of Spool 3 is estimated with low confidence`
`        to be 5 rows (525 bytes).  The estimated time for this step is`
`        0.62 seconds.`
`     -> The total estimated time is 0.63 seconds. `

From the known hierarchical relationship between region and nation, which is 1   5, the predicted number of nationkey rows is 5, which is verified by the EXPLAIN text highlighted in boldface type. By generalizing, if you were to write a request that selected 2 regionkey values, there would be 10 nationkey rows in the result set, and so on.

If there are multiple join predicates, and the join columns have some degree of correlation, multicolumn statistics are necessary to more accurately estimate join cardinality, rows per value, skew detection, and related statistics. Without having multicolumn statistics, and assuming column independence, the Optimizer multiplies the individual numbers of unique values to get an estimate of the combined number of unique values. However, if multicolumn statistics exist that are a superset of the join columns, they might provide more useful information about the column correlations than the column independence assumption would provide.

For example, consider the following query.

`     SELECT * `
`     FROM t1, t2 `
`     WHERE t1.d1 = t2.d2 `
`     AND   t1.x1 = t2.x2;`

The system requires multicolumn statistics on (t1.d1, t1.x1) and (t2.d2, t2.x2) to be able to make accurate cardinality and costing estimates. If there are individual statistics on t1.d1, with 100 unique values, and t1.x1, with 50 unique values, and assuming column independence, the unique values are multiplied to calculate the combined maximum number of unique values, `100 * 50 = 5,000`, and having an upward bound of the total number of rows.

The derived statistics subsystem identifies such opportunities in the prejoin planning stage and derives dummy multicolumn statistics based on super sets. For example, if there is a multicolumn statistics collection that is a superset, such as (x1, y1, d1), with only 2,000 unique values, the Optimizer can cap the estimated unique value cardinality of 5,000 to a more accurate 2,000 by deriving a derived statistics entry (x1, y1) with 2,000 unique values, and then propagate the statistic to Join Planning.

Similarly, if there are individual entries and multicolumn statistics entries such as (x1), (d1) and (x1, d1) and single‑table predicates such as `x1 IN (1,2) AND d1 < 20`, the Optimizer can update the demographics of the single‑column entries while making its single‑table estimates. Based on the updated information from the single‑column entries, the multicolumn entry of derived statistics is updated and propagated.

The Optimizer assumes join uniformity when it estimates join cardinalities and costs. In other words, it assumes that every value from the left relation in the join finds a match in the right relation if the number of values in the left relation is less than the number of right values in the right relation, and vice versa.

Based on this assumption, and given the join predicate `t1.d1 = t2.d2`, if t1.d1 has 100 unique values and t2.d2 has 50 unique values, then after the join, t1.d1 should have only 50 unique values. The Optimizer also considers multicolumn statistics for making adjustments. For example, given the following predicate and (x1, y1) with 100 unique values and (x2, y2) with 50 unique values, after the join, the number of (x1, y1) unique values is adjusted to 50.

`     t1.x1=t2.x2 AND t1.y1=t2.y2`

The derived statistics subsystem performs this kind of analysis and adjusts the demographics of the join columns such as the number of unique values and the high modal frequency after each binary join and then propagates the adjusted demographic information to the next stages of the join planning process.

If column correlation information is available, the Optimizer uses it to adjust the other column demographics based on the adjustments to the join column. Otherwise, the join column is assumed to be independent of the other columns.

Assume that t1.d1 has 100 unique values, t2.d2 has 50 unique values, and t3.d3 has 200 unique values.

Suppose you submit the following query.

`     SELECT t3.d3, COUNT (*) `
`     FROM t1, t2, t3 `
`     WHERE t1.d1 = t2.d2 `
`     AND   t1.d1 = t3.d3 `
`     GROUP BY 1;`

Assuming the join order is (t1 x t2) x t3, after the first join, the derived statistics subsystem adjusts the number of unique values of t1.d1 to the minimum number of unique values (t1.d1, t2.d2) = 50.

After the second join, the derived statistics subsystem adjusts the number of unique values of t3.d3 to 50, which is then used in the final aggregate estimate because t3.d3 is specified as a grouping column.

Some join predicates can become redundant after the system has made one or more binary joins. A predicate is redundant when it adds nothing to the overall selectivity for a query because its effect is equivalent to one or more other predicates that are also specified in that query.

The redundancy can result from either transitive closure or from user‑defined query conditions. If such redundant predicates are not properly identified and handled during cardinality estimates and costing, they can have a negative effect on the optimization of the query, possibly leading to non‑optimal query plans.

For example, suppose you submit the following query.

`     SELECT * `
`     FROM t1, t2, t3 `
`     WHERE t1.d1 = t2.d2 `
`     AND   t1.d1 = t3.d3;`

Transitive closure (see “Predicate Simplification” on page 94) derives a new predicate `t2.d2 = t3.d3` for this request. Assuming the join order (t1 x t2  j1) x t3, the join predicates for the second join are `j1.d1 = t3.d3` and `j1.d2 = t3.d3`.

While estimating the number of unique values of (d1,d2) for join j1, the Optimizer needs to be aware that these 2 columns had already been equated in the previous join, and it should not multiply their individual unique values to get the combined number of unique values. Instead, it should use MIN(j1.d1, j1.d2) as the combined number of unique values.

The derived statistics infrastructure builds the appropriate entries by combining all the connected joined columns into EquiSets after every binary join. This way, the subsequent joins seamlessly handle the redundancy.

In join planning, an EquiSet is a set of columns that was equated in predicates from a previous join operation in the same query. Propagating EquiSets to subsequent join operations for reuse in a query is a fundamental component of the derived statistics framework.

For example, in the preceding example, an EquiSet derived statistics entry (j1.d1, j1.d2) exists after join j1, with the minimum number of values of d1 and d2. When the subsequent join requests the unique values for the combination (j1.d1, j1.d2) the Optimizer automatically uses the existing EquiSet entry.

As demonstrated in the preceding topics, information such as the number of unique values and the high modal frequency can be taken or derived from multiple sources such as single‑table join indexes, multicolumn statistics, dynamic AMP samples, and interval histograms, among others.

It is not trivial to determine which of the available sources is likely to be the most useful or trustworthy. The following criteria all play an important role in making the determination.

• The information source captures the highest correlation.
• The information source covers the greatest number of single-table predicates.
• The information source is the most current (or, phrased another way, the least stale) source of information available.
• For example, it is possible to have a join index producing non‑correlated demographic information covering multiple single-table predicates, while at the same time there might be a multicolumn statistic covering one single-table predicate that produces highly correlated information for a join column.

Because of issues such as this, it is not possible to define precedence rules based on heuristics. To work around this restriction, the Optimizer quantifies the information and defines precedence dynamically based on the derived quantities.

• The first 2 criteria, highest correlation and greatest number of single‑table predicates, can be quantified by using the number of unique values.
• This translates to using the information source that provides the fewest unique values.

• The third criterion, least stale statistics, is also considered and explained in “Stale Statistics” on page 273.
• Column demographics are propagated to all temporary and committed joins using the derived statistics framework. As a result, demographics can be dynamically adjusted after each join, and the Join Planner does not need to reference base table interval histograms to retrieve the statistics.

If you specify complex predicates or outer joins in a query, they might become materialized into spool relations while the Optimizer is performing join planning within a query block.

Suppose you submit the following outer join query with a complex ON clause predicate.

`     SELECT * `
`     FROM t1 INNER JOIN t2 ON (x1=x2) `
`             LEFT OUTER JOIN t3 ON (x2 NOT IN (SELECT x4 `
`                                               FROM t4) `
`                                           OR y3=10);`

The ON clause predicate `x2 NOT IN (SELECT x4 FROM t4) OR y3=10` in this query makes this a complex outer join (if a subquery is specified in an ON clause, the predicate is classified as either semicomplex or complex).

One of the requirements for processing a complex outer join is to have only one left relation, so the left relations (t1, t2) are materialized into a single spool relation before the system processes the complex join. After materializing the left relations, derived statistics propagates the projected column demographic information for the materialized instance to the subsequent stages of join planning.

To be able to do aggregate and ordered-analytic cardinality estimates after a series of joins, the demographic information of the grouping columns needs to be propagated through the join operations until final aggregations are done. The derived statistics subsystem adjusts the demographic information based on single‑table and join predicates and then propagates it forward until final aggregation completes.

For example, suppose you submit the following query.

`     SELECT t1.x1, t2.x2, COUNT (*) `
`     FROM t1, t2 `
`     WHERE t1.x1 =10 `
`     AND   t1.y2 = t2.x2 `
`     GROUP BY 1, 2;`

In the prejoin planning stage, derived statistics detects the single‑table equality term on column x1 (`t1.x1=10`) and adjusts the number of unique values to 1. Also while doing join planning, the number of unique values of t2.x2 is adjusted to the minimum number of values of (t1.y2, t2.x2). This improves the aggregate cardinality estimates, which can benefit the overall plan if this query block is a spooled derived table or a view.

Suppose you submit the following query.

`     SELECT * `
`     FROM t1, (SELECT x2, x3, COUNT(*) `
`               FROM t2, t3 `
`               WHERE x2=10 `
`               AND   t2.x3=t3.x3 `
`               GROUP BY 1, 2) AS dt `
`     WHERE t1.x1=dt.x3;`

The derived statistics framework carries the column demographic information of the derived table dt for columns x2 and x3 by making adjustments based on the derived table predicates. The adjusted demographics are then propagated to the join planning for the outer block.