Deriving Column Demographics | Optimizer Process | Teradata Vantage - Deriving Column Demographics - Advanced SQL Engine - Teradata Database

SQL Request and Transaction Processing

Product
Advanced SQL Engine
Teradata Database
Release Number
17.10
Published
July 2021
Language
English (United States)
Last Update
2021-07-28
dita:mapPath
uqf1592445067244.ditamap
dita:ditavalPath
uqf1592445067244.ditaval
dita:id
B035-1142
lifecycle
previous
Product Category
Teradata Vantage™

The Optimizer acquires derived statistics in several different ways. The next several topics introduce the more common acquisition methods of deriving statistics.

Using Single-Table Sparse Join Indexes to Derive Column Demographics

If there are single-table sparse join indexes that either fully or partially cover the single-table predicates of a relation, the Optimizer can use them to derive the demographics of the residual columns. This mechanism provides an automated way of deriving column correlations using single-table join indexes.

For example, suppose you have defined the following join index:

CREATE JOIN INDEX ji AS
 SELECT c1, d1
 FROM t1
 WHERE x1 > 10
 AND   y1 =10;

You then submit the following query:

SELECT t1.c1, COUNT (*)
FROM t1, t2
WHERE t1.d1 = t2.d2
AND   t1.x1 > 10
AND   t1.y1 = 10
GROUP BY 1;

Using the statistics on ji, the system can derive the demographics of the columns t1.c1 and t1.d1 after applying the single-table predicates t1.x1 > 10 and t1.y1 = 10. The newly derived demographics can then be used in join planning for the join column t1.d1 and in aggregate estimates to determine the number of groups for the grouping column t1.c1.

Note that partially covering join indexes can also used to derive the demographics in all applicable cases. For example, suppose you define the following join index.

CREATE JOIN INDEX ji AS
  SELECT c1, d1
  FROM t1
  WHERE x1 > 10;

You then submit the following query:

SELECT t1.c1, COUNT (*)
FROM t1, t2
WHERE t1.d1 = t2.d2
AND   t1.x1 > 10
AND   t1.y1 = 10
GROUP BY 1;

In this case, the demographic information from the partially covering join index is used for columns c1 and d1. If the demographic information is available from both the base table and the join index, then the join index information is given higher precedence because it also captures the correlation information with respect to some single-table predicates.

Using Aggregate Join Indexes to Derive Column Demographics

The Optimizer can derive the unique values of grouping columns from aggregate join indexes. The derived values can then be used in join planning. This is another strategy to derive the column correlation information among the grouping column set automatically, and also after applying the single-table predicates.

For example, suppose you have defined the following aggregate join index:

CREATE JOIN INDEX ji AS
  SELECT a1, SUM (b1)
  FROM t1
  WHERE x1 > 10
  AND y1 = 10
  GROUP BY c1, d1;

You then submit the following query:

SELECT *
FROM t1, t2
WHERE t1.c1 = t2.c2
AND   t1.d1 = t2.d2
AND   x1 > 10
AND   y1 = 10;

The cardinality of the aggregate join index is the number of unique values of the columns (t1.c1, t2.d1). If the unique values for these columns are already available from base table statistics, the system updates them with the information from the aggregate join index; otherwise, it creates new derived statistics entries. The unique values can then be used in join planning for estimating join cardinality, rows per value, skew detection, and so on.

Note that partially covered aggregate join indexes are also handled in the same way as non-aggregate single-table sparse join indexes, and that the statistics from aggregate join indexes are used automatically if that index replaces a derived table or view.

Using CHECK Constraints and Referential Integrity Constraints to Derive Column Demographics

CHECK and referential integrity constraints can also provide useful information to the Optimizer in the absence of statistics. For example, consider a column called patient_sex, which can have only 2 values: M and F. Generally, users enforce the sanity of this column with a database CHECK constraint such as patient_sex IN ("M","F").

If there are no statistics on this column and a query specifies the predicate patient_sex = 'M', the system would normally assume that 10% of the rows qualify by default because of the equality condition, but because this column can have only 2 values, a more reasonable default is 50%.


Total qualifying rows equation

The Optimizer also considers CHECK constraints such as open or closed ranges, and IN lists when statistics are not available. Open ranges help in some scenarios when a query predicate can close the range. For example, if a CHECK constraint specifies the open range x1 > 10, and a query specifies the predicate x1 < 20, their combination produces a closed range that can then be used to derive the column demographics.

Similarly, if there are no statistics on the child table in a Primary Key-Foreign Key relationship, and statistics exist for the parent table PK column, many of its demographics, such as number of unique values, can be derived and used for the FK column in the child table. This information can be used for single-table estimates and in join planning for join cardinality and rows per value estimates. These kinds of estimates can also be useful for large child tables in those situations where collecting statistics is prohibitively expensive.

This mechanism can be effectively used in conjunction with non-enforced referential integrity constraints (also known as soft referential integrity constraints) to provide child table demographic information to the Optimizer without collecting statistics.

Note that this mechanism assumes that the child table has all the values of the parent table. If there is only a subset of values, or if the value set in this join column is significantly skewed, then you should collect statistics on the child column to avoid skewed redistributions and underestimates of cardinalities.

The derived statistics framework collects this kind of information from CHECK and referential constraints in the prejoin planning stage of query optimization and propagates them from there to the join planning stage.

Using Single-Table Predicates to Derive Column Demographics

Single-table estimation logic determines the probable number of qualified unique values and qualified high modal frequency, along with the number of qualified rows, for every predicate. The Optimizer can use this information for cases where a column specified in the single-table predicate is also specified in subsequent operations such as joins and aggregations.

The system can derive this information in the following ways:
  • Using the base table statistics
  • Using the statistics on a single-table join index

By using the base table statistics, Vantage can derive the column demographics using the single-table predicate on the column specified in the predicate. However, the derived information might not be adjusted later based on other single-table predicates. This is in line with the assumption of column independence in the absence of column correlation information.

For example, suppose you have the following query:

SELECT *
FROM t1, t2
WHERE   t1.d1 = t2.d2
AND     t1.d1 BETWEEN '1999-01-01' AND '2000-01-01'
AND     t1.c1 > 10;

Using base table statistics, the qualified number of unique values and qualified high modal frequency are derived and saved in the derived statistics entry for column t1.d1.

On the other hand, by using the available single-table join indexes, the Optimizer can get the demographics of the columns after applying all single-table predicates.

For example, suppose you have defined the following join index to support the previous query:

CREATE JOIN INDEX ji AS
  SELECT t1.d1
  FROM t1
  WHERE   t1.d1 BETWEEN '1999-01-01' AND '2000-01-01'
  AND     t1.c1 > 10;

If an interval histogram is available for the join index column t1.d1, then it is given higher precedence and captured in the derived statistics because it reflects the demographics of the column t1.d1 after both single-table predicates have been applied. See Using Join Index Statistics to Estimate Single-Table Expression Cardinalities for information about how the Optimizer can use single-table join index statistics to estimate the cardinalities of complex expressions that reference base table columns.

The derived information is then reused in join planning stages such as join cardinality estimation, rows per value estimation, and skew detection.

In general, the Optimizer considers any predicate that can be used for cardinality estimates for a query plan with the exception of complicated predicates. The following table provides some examples of predicates the Optimizer does consider and the statistics it uses to make cardinality estimates for those predicates.

Predicate Definition
IS NULL Self-defined.
Self-defined.
IS NOT NULL
  • If nulls are present, the value is calculated as (TotalValues - 1).
  • If no nulls are present, the value is just TotalValues.
TotalRows - Number of Nulls
Single equality condition Self-defined.
  • If Loner/Mode is in any interval, the value is its frequency.
  • If Loner/Mode is not in any interval, the value is calculated as .
Index with equality conditions on all columns Self-defined.
Self-defined.
NE TotalValues - 1
HighModeFreq of the distribution excluding the HighModeFreq for this value.
  • LT
  • LE
  • GT
  • GE
Values count derived from qualified intervals.
HighModeFreq from the qualified intervals.
BETWEEN LowValues + HighValues - TotalValues
HighModeFreq from the qualified range intervals.
  • ORed term
  • IN list
The number of elements in the ORed list.
The maximum frequency of all the ORed elements.
NOT IN term TotalValues - Number of elements in NOT IN list
The HighMode after excluding all IN list elements.

Examples of a term type that is not considered are ORed terms that reference multiple columns such as x1=10 OR y1>20.

Using Single-Table Predicates to Derive Column Demographics in the Absence of Statistics

If single-table predicates such as EQ, IN lists, ORed predicates or closed ranges, are specified in a query, Vantage can derive some information from them, such as the number of unique values, and the derived information can then be used later in join planning if column statistics are not available.

An example of where such statistics would be derived is the following query:

SELECT *
FROM t1, t2
WHERE t1.d1 = t2.d2
AND   t1.d1 in (10, 20);

Derived statistics captures and propagates the number of unique values information to the Join Planner based on certain single-table predicates even when no statistics have been collected on this column.

Using Single-Table Predicates and Multicolumn (PARTITION) Statistics to Derive Column Demographics

The demographics of the join columns are required after applying the single-table predicates to a join request. The Optimizer can use multicolumn statistics to derive the demographics of the join columns using some single-table predicates. This is another automated way of deriving column correlations using multicolumn statistics.

For example, suppose you submit the following query:

SELECT *
FROM t1, t2
WHERE t1.d1 = t2.d2
AND   t1.x1 > 10;

The Join Planner needs to know the demographics of the join column t1.d1 after applying the single-table predicate t1.x1>10. Derived statistics derives the demographics of t1.d1 if there are multicolumn statistics (x1, d1) on t1. For PPI tables, if there is row partition elimination and the multicolumn PARTITION statistics are available, then the demographics of the join columns are derived based on the qualified partitions.

Note that the order of the columns in multicolumn statistics is important for these estimates. The columns are reordered internally based on their ascending field IDs irrespective of the order you specify for collecting multicolumn statistics. So if column d1 has smaller field ID than column x1, the multicolumn statistics are ordered internally as (d1, x1). In this case, it is not possible to derive the demographics of d1 for a given predicate on x1. Note that for multicolumn PARTITION statistics the PARTITION column is always the first column because it always has an internal field ID of 0.

Some of the single-table predicates the Optimizer considers for deriving the column correlations are equality conditions (x1=10), IN lists (x1 IN (10, 20, 30)), simple ORed lists on the same columns (x1=10 OR x1=20), and range predicates (x1 BETWEEN 10 AND 30, x1>10), and so on.

Also see Using Join Index Statistics to Estimate Single-Table Expression Cardinalities for information about how the Optimizer can use single-table join index statistics to estimate the cardinalities of complex expressions that reference base table columns.

Using Single-Column and Multicolumn Statistics to Discover Hierarchical Relationships in Denormalized Schemas

The derived statistics framework discovers hierarchical relationships for denormalized tables using statistics. The relationships are saved as value mappings in the derived statistics. The only mappings used to adjust demographics are those with 1 -->  n relationships.

A combination of single and multicolumn statistics is required to detect these relationships.

A change to a single column cascades through the entire hierarchical chain. For example, if the value mapping for x --> y is 1 --> 5 and the value mapping for y-->  z is 1 --> 10, then if one value is removed from x, 5 values are removed from y, and 10 values are removed from z.

More concretely, if the relationship between region --> nation is discovered to be 1 -->  5, then if one region is selected, the Optimizer detects that only 5 nations qualify.

In other words, if one region is disqualified by a single-table predicate, the Optimizer removes 5 nations from the nation column.

The following example illustrates this logic. Consider the denormalized dimension table customer_nation_region, which is defined as follows:

CREATE SET TABLE cust_nation_region (
  c_custkey       INTEGER,
  c_name          VARCHAR(25) CHARACTER SET LATIN CASESPECIFIC,
  c_address       VARCHAR(40) CHARACTER SET LATIN CASESPECIFIC,
  c_nationkey     INTEGER,
  c_phone         CHAR(15) CHARACTER SET LATIN CASESPECIFIC,
  c_acctbal       DECIMAL(15,2),
  c_mktsegment    CHAR(10) CHARACTER SET LATIN CASESPECIFIC,
  c_comment       VARCHAR(117) CHARACTER SET LATIN CASESPECIFIC,
  c_maritalstatus CHAR(1) CHARACTER SET LATIN NOT CASESPECIFIC,
  n_name          CHAR(25) CHARACTER SET LATIN CASESPECIFIC,
  n_comment       VARCHAR(152) CHARACTER SET LATIN CASESPECIFIC,
  r_regionkey     INTEGER,
  r_name          CHAR(25) CHARACTER SET LATIN CASESPECIFIC,
  r_comment       VARCHAR(152) CHARACTER SET LATIN CASESPECIFIC)
  PRIMARY INDEX (c_custkey);
In this example, the relationships are derived from a combination of single- and multicolumn statistics. The following statistics are required to discover the relationships:
  • Single-column on r_regionkey
  • Single-column on r_regionkey
  • Single-column on r_regionkey
  • Multicolumn on (r_regionkey, n_nationkey)
  • Multicolumn on (n_nationkey, c_custkey)
The relationships discovered are.
  • region -->  nation is 1 --> 5
  • nation -->  customer is 1 -->  24,000
You can see an example of this in the following query, which specifies a predicate that selects only one regionkey value, and its EXPLAIN text:
EXPLAIN SELECT n_nationkey
FROM cust_nation_region
WHERE r_regionkey = 1
GROUP BY 1;
The following is a portion of the EXPLAIN output:
...
 2) Next, we lock TPCD_OCES3.cust_nation_region for read.3) We do an all-AMPs SUM         step to aggregate from TPCD_OCES3.cust_nation_region by way of an all-rows
        scan with a condition of ("TPCD_OCES3.cust_nation_region.R_REGIONKEY = 1"),
        grouping by field1 ( TPCD_OCES3.cust_nation_region.N_NATIONKEY).
        Aggregate Intermediate Results are computed globally, then placed
        in Spool 3.  The size of Spool 3 is estimated with low confidence 
        to be 5 rows  (525 bytes). The estimated time for this step is
        0.62 seconds.
     -> The total estimated time is 0.63 seconds.

From the known hierarchical relationship between region and nation, which is 1 -->  5, the predicted number of nationkey rows is 5, which is verified by the EXPLAIN text highlighted in boldface type. By generalizing, if you were to write a request that selected 2 regionkey values, there would be 10 nationkey rows in the result set, and so on.

Using Subset and Superset Statistics to Derive Column Demographics

If there are multiple join predicates, and the join columns have some degree of correlation, multicolumn statistics are necessary to more accurately estimate join cardinality, rows per value, skew detection, and related statistics. Without having multicolumn statistics, and assuming column independence, the Optimizer multiplies the individual numbers of unique values to get an estimate of the combined number of unique values. However, if multicolumn statistics exist that are a superset of the join columns, they might provide more useful information about the column correlations than the column independence assumption would provide.

For example, consider the following query:

SELECT *
FROM t1, t2
WHERE t1.d1 = t2.d2
AND   t1.x1 = t2.x2;

The system requires multicolumn statistics on (t1.d1, t1.x1) and (t2.d2, t2.x2) to be able to make accurate cardinality and costing estimates. If there are individual statistics on t1.d1, with 100 unique values, and t1.x1, with 50 unique values, and assuming column independence, the unique values are multiplied to calculate the combined maximum number of unique values, 100 * 50 = 5,000, and having an upward bound of the total number of rows.

The derived statistics subsystem identifies such opportunities in the prejoin planning stage and derives dummy multicolumn statistics based on super sets. For example, if there is a multicolumn statistics collection that is a superset, such as (x1, y1, d1), with only 2,000 unique values, the Optimizer can cap the estimated unique value cardinality of 5,000 to a more accurate 2,000 by deriving a derived statistics entry (x1, y1) with 2,000 unique values, and then propagate the statistic to Join Planning.

Similarly, if there are individual entries and multicolumn statistics entries such as (x1), (d1) and (x1, d1) and single-table predicates such as x1 IN (1,2) AND d1 < 20, the Optimizer can update the demographics of the single-column entries while making its single-table estimates. Based on the updated information from the single-column entries, the multicolumn entry of derived statistics is updated and propagated.

Using Join Predicates to Derive Column Demographics

The Optimizer assumes join uniformity when it estimates join cardinalities and costs. In other words, it assumes that every value from the left relation in the join finds a match in the right relation if the number of values in the left relation is less than the number of right values in the right relation, and vice versa.

Based on this assumption, and given the join predicate t1.d1 = t2.d2, if t1.d1 has 100 unique values and t2.d2 has 50 unique values, then after the join, t1.d1 should have only 50 unique values. The Optimizer also considers multicolumn statistics for making adjustments. For example, given the following predicate and (x1, y1) with 100 unique values and (x2, y2) with 50 unique values, after the join, the number of (x1, y1) unique values is adjusted to 50.

t1.x1=t2.x2 AND t1.y1=t2.y2

The derived statistics subsystem performs this kind of analysis and adjusts the demographics of the join columns such as the number of unique values and the high modal frequency after each binary join and then propagates the adjusted demographic information to the next stages of the join planning process.

If column correlation information is available, the Optimizer uses it to adjust the other column demographics based on the adjustments to the join column. Otherwise, the join column is assumed to be independent of the other columns.

Assume that t1.d1 has 100 unique values, t2.d2 has 50 unique values, and t3.d3 has 200 unique values.

Suppose you submit the following query:
SELECT t3.d3, COUNT (*)
FROM t1, t2, t3
WHERE t1.d1 = t2.d2
AND   t1.d1 = t3.d3
GROUP BY 1;

Assuming the join order is (t1 x t2) x t3, after the first join, the derived statistics subsystem adjusts the number of unique values of t1.d1 to the minimum number of unique values (t1.d1, t2.d2) = 50.

After the second join, the derived statistics subsystem adjusts the number of unique values of t3.d3 to 50, which is then used in the final aggregate estimate because t3.d3 is specified as a grouping column.

Using Join Predicate Redundancy to Derive Column Demographics After Each Binary Join

Some join predicates can become redundant after the system has made one or more binary joins. A predicate is redundant when it adds nothing to the overall selectivity for a query because its effect is equivalent to one or more other predicates that are also specified in that query.

The redundancy can result from either transitive closure or from user-defined query conditions. If such redundant predicates are not properly identified and handled during cardinality estimates and costing, they can have a negative effect on the optimization of the query, possibly leading to non-optimal query plans.

For example, suppose you submit the following query:

SELECT *
FROM t1, t2, t3
WHERE t1.d1 = t2.d2
AND   t1.d1 = t3.d3;

Transitive closure derives a new predicate t2.d2 = t3.d3 for this request. Assuming the join order (t1 x t2 -->   j1) x t3, the join predicates for the second join are j1.d1 = t3.d3 and j1.d2 = t3.d3.

For more information about transitive closure, see Predicate Simplification.

While estimating the number of unique values of (d1,d2) for join j1, the Optimizer needs to be aware that these 2 columns had already been equated in the previous join, and it should not multiply their individual unique values to get the combined number of unique values. Instead, it should use MIN(j1.d1, j1.d2) as the combined number of unique values.

The derived statistics infrastructure builds the appropriate entries by combining all the connected joined columns into EquiSets after every binary join. This way, the subsequent joins seamlessly handle the redundancy.

In join planning, an EquiSet is a set of columns that was equated in predicates from a previous join operation in the same query. Propagating EquiSets to subsequent join operations for reuse in a query is a fundamental component of the derived statistics framework.

For example, in the preceding example, an EquiSet derived statistics entry (j1.d1, j1.d2) exists after join j1, with the minimum number of values of d1 and d2. When the subsequent join requests the unique values for the combination (j1.d1, j1.d2) the Optimizer automatically uses the existing EquiSet entry.

Handling Multiple Sources of Information to Derive Column Demographics

As demonstrated in the preceding topics, information such as the number of unique values and the high modal frequency can be taken or derived from multiple sources such as single-table join indexes, multicolumn statistics, dynamic AMP samples, and interval histograms, among others.

It is not trivial to determine which of the available sources is likely to be the most useful or trustworthy. The following criteria all play an important role in making the determination:
  • The information source captures the highest correlation.
  • The information source covers the greatest number of single-table predicates.
  • The information source is the most current (or, phrased another way, the least stale) source of information available.

For example, it is possible to have a join index producing non-correlated demographic information covering multiple single-table predicates, while at the same time there might be a multicolumn statistic covering one single-table predicate that produces highly correlated information for a join column.

Because of issues such as this, it is not possible to define precedence rules based on heuristics. To work around this restriction, the Optimizer quantifies the information and defines precedence dynamically based on the derived quantities.
  • The first two criteria, highest correlation and greatest number of single-table predicates, can be quantified by using the number of unique values.

    This translates to using the information source that provides the fewest unique values.

  • The third criterion, least stale statistics, is also considered and explained in Stale Statistics.

Propagating Column Demographics To All Temporary and Committed Joins

Column demographics are propagated to all temporary and committed joins using the derived statistics framework. As a result, demographics can be dynamically adjusted after each join, and the Join Planner does not need to reference base table interval histograms to retrieve the statistics.

Propagating Column Demographics For Materialized Instances Within a Query Block

If you specify complex predicates or outer joins in a query, they might become materialize into spool relations while the Optimizer is performing join planning within a query block.

Suppose you submit the following outer join query with a complex ON clause predicate:

SELECT *
FROM t1 INNER JOIN t2 ON (x1=x2)
        LEFT OUTER JOIN t3 ON (x2 NOT IN (SELECT x4
                                          FROM t4)
                                      OR y3=10);

The ON clause predicate x2 NOT IN (SELECT x4 FROM t4) OR y3=10 in this query makes this a complex outer join (if a subquery is specified in an ON clause, the predicate is classified as either semicomplex or complex).

One of the requirements for processing a complex outer join is to have only one left relation, so the left relations (t1, t2) are materialized into a single spool relation before the system processes the complex join. After materializing the left relations, derived statistics propagates the projected column demographic information for the materialized instance to the subsequent stages of join planning.

Propagating Column Demographics For Aggregate and Ordered-Analytic Function Estimates

To be able to do aggregate and ordered-analytic cardinality estimates after a series of joins, the demographic information of the grouping columns needs to be propagated through the join operations until final aggregations are done. The derived statistics subsystem adjusts the demographic information based on single-table and join predicates and then propagates it forward until final aggregation completes. For example, suppose you submit the following query:

SELECT t1.x1, t2.x2, COUNT (*)
FROM t1, t2
WHERE t1.x1 =10
AND   t1.y2 = t2.x2
GROUP BY 1, 2;

In the prejoin planning stage, derived statistics detects the single-table equality term on column x1 (t1.x1=10) and adjusts the number of unique values to 1. Also while doing join planning, the number of unique values of t2.x2 is adjusted to the minimum number of values of (t1.y2, t2.x2). This improves the aggregate cardinality estimates, which can benefit the overall plan if this query block is a spooled derived table or a view.

Propagating Column Demographics Across Query Blocks

Suppose you submit the following query:

SELECT *
FROM t1, (SELECT x2, x3, COUNT(*)
          FROM t2, t3
          WHERE x2=10
          AND   t2.x3=t3.x3
          GROUP BY 1, 2) AS dt
WHERE t1.x1=dt.x3;

The derived statistics framework carries the column demographic information of the derived table dt for columns x2 and x3 by making adjustments based on the derived table predicates. The adjusted demographics are then propagated to the join planning for the outer block.

Using Histograms to Estimate Cardinalities

An expression that can use histograms for cardinality estimation can contain either a unary or a binary operator. One of the operands must be either a column or a built-in functions or SQL operator from the following lists:
  • UPPER
  • LOWER
  • NULLIFZERO
  • ZEROIFNULL
  • SUBSTR
  • MOD
  • CONCAT on columns of same table
  • Implicit or explicit data type conversions on a column
The supported operators are the following:
  • =
  • >
  • <
  • <>
  • IS NULL
  • IS NOT NULL
  • LIKE
  • NOT LIKE
  • BETWEEN
  • IN
For binary operators, the other operand can be a simple expression involving any of the following:
  • Constants whose value can be calculated by the Parser.
  • System USING request modifier data or built-in function data such as a CURRENT_DATE value.
  • Simple expressions involving another column from the same table.

    For example, the selectivity of the single-table predicate t1.x1 = t1.y1 can be estimated more reasonably by considering the overlapping values of those 2 columns than by using a default selectivity formula.

Selectivity estimates for LIKE predicates is limited to "abc%" patterns and very conservative estimation formulas.

The Optimizer can use a multicolumn histogram for a group of predicates when the following statements are all true:
  • The predicates are specified on the first n fields of the multicolumn histogram.

    This rule exists because there is an ordering dependency of the fields of a multicolumn histogram.

  • The predicates must specify an equality condition except for the first field of the multicolumn histogram.
  • If the predicate on the first column of a multicolumn is a non-equality condition, then the Optimizer uses the multicolumn histogram for this predicate only.

For example, a histogram on (x, y, z) can be used to estimate the selectivity for predicate x>100 as well as x=10 AND y=20.

The Optimizer can also use the data derived by date extrapolation to enhance its cardinality estimates for date-related predicates (see Using Extrapolation to Replace Stale Statistics).

Combining Selectivities

The Optimizer can detect independent columns and calculate their combined selectivity as the product of their individual selectivities. The following categories of independent columns are defined:
  • Every value of one column maps to all values of the other column.

    In this case, the number of combined values is the product of the number of values of individual column. Take column nation and column market-segment of the customer table as an example.

    Each nation participates in the business of all market segments, while business in each market segment is provided in all countries.

  • The value of one column is not constrained by the value of the other column, but given a value of the first column of 2 independent columns, the possible value of the first column is evenly distributed across all values of the second column.

    Column market-segment and column account-balance of the customer table are an example of such independent columns. On the one hand, the account balance is not bounded by a specific market segment even though one account balance, say $100.23, does not exist in all market segments.

There are three cases where the Optimizer cannot detect column independence.
  • When the total size of all columns exceeds the histogram data row size limit of 16 bytes. In this case, multicolumn histogram does not have all the information necessary to make an accurate determination of column independence, so the test for independence could fail.
  • When independence detection is activated only when the confidence of the individual selectivities is high. That is, for predicates whose selectivity is not high confidence because no statistics have been collected on the columns, or because an expression is complex, the Optimizer does not attempt to detect column independence.
  • When independence detection is activated only when the selectivity estimation of a predicate on an individual column is based on the base table histogram.

Finding the Number of Unique Values in a Set of Join or Grouping Columns

Join cardinality estimates, rows per value estimates, skew detection, aggregate estimates, Partial Group By estimates, and several other computations all require estimating the number of unique values for a given set of join or grouping columns.

The Optimizer makes an exhaustive search of all possible combinations of statistics to determine the best set of non-overlapping statistics. Once it has been decided, that set is then used to find the Number of Unique Values, High Modal Frequency, and High AMP Frequency values at all stages of the optimization process.

The major goals of this algorithm are as follows:
  • Find the set of non-overlapping combinations of statistics that has the least number of unique values.

    If no complete coverage can be found, find the set of non-overlapping combinations of statistics that covers the largest number of columns and has the least number of unique values.

  • Find the set of non-overlapping combinations of statistics that provides the highest High Modal Frequency and High AMP Frequency values by covering all the columns.

The set of combinations that provides the smallest Number of Unique Values might not be the same set that provides the best estimates for the High Modal Frequency and High AMP Frequency values because the High Modal Frequency and High AMP Frequency might not be available for some of the entries that are derived from sources other than base table interval histograms.

The number of unique values determined from partial statistics, where statistics do not cover all the hash or join columns, is considered for Rows Per Value and join cardinality estimates because it is a conservative approach, but the Optimizer does not consider this estimate for skew adjustment and Partial GROUP BY because it makes very aggressive cost estimates and, as a result, can cause performance problems when the costing estimates are grossly in error because of the overly aggressive method by which they were calculated. In other words, the Optimizer avoids Partial GROUP BY plans and does not attempt to make skew adjustments when the given hash or join columns are not fully covered by statistics.

The unique values discovery algorithm provides the following information (see EXPLAIN Confidence Levels for the definitions of the various Optimizer confidence levels for making cardinality estimates):
  • MinVals and its confidence level

    This estimate provides the absolute minimum number of values for the given collection of columns. The values are taken from a single derived statistics entry that covers the largest number of columns. If there are multiple entries that cover the same number of columns, the Optimizer selects the entry with the highest number of values.

    The confidence levels for various entries are described in the following table.

    IF a usable derived statistics entry is … The confidence level is …
    found High confidence
    not found No confidence
  • BestVals and its confidence level
    This provides the best number of values estimate that can be derived from the set of derived statistics entries that meets both of the following criteria.
    • Covers the greatest number of columns
    • Produces the least number of values

    Derived statistics entries that are either the same set, or a subset, of the given column collection are used to produce these values.

    The values can even be taken from a set of derived statistics entries that covers only a portion of the columns in some cases. This can happen when, for example, there are insufficient statistics to cover all the columns.

    The confidence levels for various entries are described in the following table:

    FOR this derived statistics entry situation … The confidence level is …
    a single entry covers all the columns High confidence
    either of the following.
    • multiple entries must be combined to cover all the columns
    • only a subset of the columns is covered
    Low confidence
    no usable derived statistics entries are found No confidence
  • MaxVals and its confidence level

    This estimate provides the maximum number of possible values for the given collection of columns. The derived statistics entries that are either a subset, a superset, an EquiSet, or intersecting entries are all considered for producing these values.

    If all columns are not covered by these entries, then default estimates based on the domain type are used for the non-covered columns, as described in the following table.

    Derived Statistics Entry Situation Confidence Level
    a single entry covers all the columns High confidence
    multiple entries must be combined to cover all the columns Low confidence
    default estimates are used to compute the values No confidence

If MinVals is found with a High or Low level of confidence, then the Optimizer can always determine a BestVals statistic. However, it is also possible to determine MaxVals with High or Low confidence, but not be able to determine a BestVals or MinVals.

The Optimizer uses BestVals for its join cardinality and RowsPerValue estimates if it has Low or High confidence.

Partial GROUP BY, aggregations, and skew detection always use MaxVals.

The values are combined in 2 levels using a stochastic model, as follows:

  1. The values from the same source are combined.

    The number of combined values is limited to the total rows of the source.

  2. The combined values from different sources are themselves combined to get the final values.

    These are limited based the total rows for the current set.

The system consumes the EquiSet entries and then derives additional combinations if any multicolumn statistics intersect with the EquiSets. For example, if the usable entries for the given hash columns (x1, x2, y2) are EquiSet [x1, x2] and (x2, y2)], then the Optimizer augments the multicolumn statistics entry (x2, y2) with other EquiSet columns, producing the new entry (x1, x2, y2) dynamically.

The following examples describe the different possibilities.

For the first example, consider the following derived statistics entries:

Column Set Number of Unique Values
(a1, b1) 10
(b1, c1) 15
(a1, b1, c1) 20

If the hash columns are (a1, b1, c1), then the Optimizer derives the following values and confidence levels:

Statistic Value Confidence Level
MinVals 20 High confidence
BestVals 20 High confidence
MaxVals 20 High confidence

For the second example, consider the following derived statistics entries:

 Column Set Number of Unique Values
(a1, b1) 10
(c1) 5

If the hash columns are (a1, b1, c1, d1), then the Optimizer derives the following values and confidence levels:

Statistic Value Confidence Level
MinVals 10 High confidence
BestVals 50

Calculated from the product of the numbers of unique values for the column sets: 10 x 5 = 50.

Low confidence
MaxVals Combination of 50 and the demographic estimate for d1. No confidence

For the last example, consider the following derived statistics entry:

Column Set Number of Unique Values
(a1, b1, c1, d1) 100

If the hash columns are (a1, b1, c1), then the Optimizer derives the following values and confidence levels:

Statistic Value Confidence Level
MinVals TotalRows No confidence
BestVals TotalRows No confidence
MaxVals 100 Low confidence