The Optimizer acquires derived statistics in several different ways. The next several topics introduce the more common acquisition methods of deriving statistics.
Using SingleTable Sparse Join Indexes to Derive Column Demographics
If there are singletable sparse join indexes that either fully or partially cover the singletable predicates of a relation, the Optimizer can use them to derive the demographics of the residual columns. This mechanism provides an automated way of deriving column correlations using singletable join indexes.
For example, suppose you have defined the following join index:
CREATE JOIN INDEX ji AS SELECT c1, d1 FROM t1 WHERE x1 > 10 AND y1 =10;
You then submit the following query:
SELECT t1.c1, COUNT (*) FROM t1, t2 WHERE t1.d1 = t2.d2 AND t1.x1 > 10 AND t1.y1 = 10 GROUP BY 1;
Using the statistics on ji, the system can derive the demographics of the columns t1.c1 and t1.d1 after applying the singletable predicates t1.x1 > 10 and t1.y1 = 10. The newly derived demographics can then be used in join planning for the join column t1.d1 and in aggregate estimates to determine the number of groups for the grouping column t1.c1.
Note that partially covering join indexes can also used to derive the demographics in all applicable cases. For example, suppose you define the following join index.
CREATE JOIN INDEX ji AS SELECT c1, d1 FROM t1 WHERE x1 > 10;
You then submit the following query:
SELECT t1.c1, COUNT (*) FROM t1, t2 WHERE t1.d1 = t2.d2 AND t1.x1 > 10 AND t1.y1 = 10 GROUP BY 1;
The demographic information from the partially covering join index is used for columns c1 and d1. If the demographic information is available from both the base table and the join index, then the join index information is given higher precedence because it also captures the correlation information with respect to some singletable predicates.
Using Aggregate Join Indexes to Derive Column Demographics
The Optimizer can derive the unique values of grouping columns from aggregate join indexes. The derived values can then be used in join planning. This is another strategy to derive the column correlation information among the grouping column set automatically, and also after applying the singletable predicates.
For example, suppose you have defined the following aggregate join index:
CREATE JOIN INDEX ji AS SELECT a1, SUM (b1) FROM t1 WHERE x1 > 10 AND y1 = 10 GROUP BY c1, d1;
You then submit the following query:
SELECT * FROM t1, t2 WHERE t1.c1 = t2.c2 AND t1.d1 = t2.d2 AND x1 > 10 AND y1 = 10;
The cardinality of the aggregate join index is the number of unique values of the columns (t1.c1, t2.d1). If the unique values for these columns are already available from base table statistics, the system updates them with the information from the aggregate join index; otherwise, it creates new derived statistics entries. The unique values can then be used in join planning for estimating join cardinality, rows per value, skew detection, and so on.
Note that partially covered aggregate join indexes are also handled in the same way as nonaggregate singletable sparse join indexes, and that the statistics from aggregate join indexes are used automatically if that index replaces a derived table or view.
Using CHECK Constraints and Referential Integrity Constraints to Derive Column Demographics
CHECK and referential integrity constraints can also provide useful information to the Optimizer in the absence of statistics. For example, consider a column called patient_gender, which can have only 2 values: M and F. Generally, users enforce the sanity of this column with a database CHECK constraint such as patient_gender IN ("M","F").
If there are no statistics on this column and a query specifies the predicate patient_gender = 'M', the system would normally assume that 10% of the rows qualify by default because of the equality condition, but because this column can have only 2 values, a more reasonable default is 50%.
The Optimizer also considers CHECK constraints such as open or closed ranges, and IN lists when statistics are not available. Open ranges help in some scenarios when a query predicate can close the range. For example, if a CHECK constraint specifies the open range x1 > 10, and a query specifies the predicate x1 < 20, their combination produces a closed range that can then be used to derive the column demographics.
Similarly, if there are no statistics on the child table in a Primary KeyForeign Key relationship, and statistics exist for the parent table PK column, many of its demographics, such as number of unique values, can be derived and used for the FK column in the child table. This information can be used for singletable estimates and in join planning for join cardinality and rows per value estimates. These kinds of estimates can also be useful for large child tables in those situations where collecting statistics is prohibitively expensive.
This mechanism can be effectively used in conjunction with nonenforced referential integrity constraints (also known as soft referential integrity constraints) to provide child table demographic information to the Optimizer without collecting statistics.
Note that this mechanism assumes that the child table has all the values of the parent table. If there is only a subset of values, or if the value set in this join column is significantly skewed, then you should collect statistics on the child column to avoid skewed redistributions and underestimates of cardinalities.
The derived statistics framework collects this kind of information from CHECK and referential constraints in the prejoin planning stage of query optimization and propagates them from there to the join planning stage.
Using SingleTable Predicates to Derive Column Demographics
Singletable estimation logic determines the probable number of qualified unique values and qualified high modal frequency, along with the number of qualified rows, for every predicate. The Optimizer can use this information for cases where a column specified in the singletable predicate is also specified in subsequent operations such as joins and aggregations.
 Using the base table statistics
 Using the statistics on a singletable join index
By using the base table statistics, Vantage can derive the column demographics using the singletable predicate on the column specified in the predicate. However, the derived information might not be adjusted later based on other singletable predicates. This is in line with the assumption of column independence in the absence of column correlation information.
For example, suppose you have the following query:
SELECT * FROM t1, t2 WHERE t1.d1 = t2.d2 AND t1.d1 BETWEEN '19990101' AND '20000101' AND t1.c1 > 10;
Using base table statistics, the qualified number of unique values and qualified high modal frequency are derived and saved in the derived statistics entry for column t1.d1.
On the other hand, by using the available singletable join indexes, the Optimizer can get the demographics of the columns after applying all singletable predicates.
For example, suppose you have defined the following join index to support the previous query:
CREATE JOIN INDEX ji AS SELECT t1.d1 FROM t1 WHERE t1.d1 BETWEEN '19990101' AND '20000101' AND t1.c1 > 10;
If an interval histogram is available for the join index column t1.d1, then it is given higher precedence and captured in the derived statistics because it reflects the demographics of the column t1.d1 after both singletable predicates have been applied. See Using Join Index Statistics to Estimate SingleTable Expression Cardinalities for information about how the Optimizer can use singletable join index statistics to estimate the cardinalities of complex expressions that reference base table columns.
The derived information is then reused in join planning stages such as join cardinality estimation, rows per value estimation, and skew detection.
In general, the Optimizer considers any predicate that can be used for cardinality estimates for a query plan with the exception of complicated predicates. The following table provides some examples of predicates the Optimizer does consider and the statistics it uses to make cardinality estimates for those predicates.
Predicate  Definition 

IS NULL  Selfdefined. 
Selfdefined.  
IS NOT NULL 

TotalRows  Number of Nulls  
Single equality condition  Selfdefined. 


Index with equality conditions on all columns  Selfdefined. 
Selfdefined.  
NE  TotalValues  1 
HighModeFreq of the distribution excluding the HighModeFreq for this value.  

Values count derived from qualified intervals. 
HighModeFreq from the qualified intervals.  
BETWEEN  LowValues + HighValues  TotalValues 
HighModeFreq from the qualified range intervals.  

The number of elements in the ORed list. 
The maximum frequency of all the ORed elements.  
NOT IN term  TotalValues  Number of elements in NOT IN list 
The HighMode after excluding all IN list elements. 
Examples of a term type that is not considered are ORed terms that reference multiple columns such as x1=10 OR y1>20.
Using SingleTable Predicates to Derive Column Demographics in the Absence of Statistics
If singletable predicates such as EQ, IN lists, ORed predicates or closed ranges, are specified in a query, Vantage can derive some information from them, such as the number of unique values, and the derived information can then be used later in join planning if column statistics are not available.
An example of where such statistics would be derived is the following query:
SELECT * FROM t1, t2 WHERE t1.d1 = t2.d2 AND t1.d1 in (10, 20);
Derived statistics captures and propagates the number of unique values information to the Join Planner based on certain singletable predicates even when no statistics have been collected on this column.
Using SingleTable Predicates and Multicolumn (PARTITION) Statistics to Derive Column Demographics
The demographics of the join columns are required after applying the singletable predicates to a join request. The Optimizer can use multicolumn statistics to derive the demographics of the join columns using some singletable predicates. This is another automated way of deriving column correlations using multicolumn statistics.
For example, suppose you submit the following query:
SELECT * FROM t1, t2 WHERE t1.d1 = t2.d2 AND t1.x1 > 10;
The Join Planner needs to know the demographics of the join column t1.d1 after applying the singletable predicate t1.x1>10. Derived statistics derives the demographics of t1.d1 if there are multicolumn statistics (x1, d1) on t1. For PPI tables, if there is row partition elimination and the multicolumn PARTITION statistics are available, then the demographics of the join columns are derived based on the qualified partitions.
Note that the order of the columns in multicolumn statistics is important for these estimates. The columns are reordered internally based on their ascending field IDs irrespective of the order you specify for collecting multicolumn statistics. So if column d1 has smaller field ID than column x1, the multicolumn statistics are ordered internally as (d1, x1). It is not possible to derive the demographics of d1 for a given predicate on x1. Note that for multicolumn PARTITION statistics the PARTITION column is always the first column because it always has an internal field ID of 0.
Some of the singletable predicates the Optimizer considers for deriving the column correlations are equality conditions (x1=10), IN lists (x1 IN (10, 20, 30)), simple ORed lists on the same columns (x1=10 OR x1=20), and range predicates (x1 BETWEEN 10 AND 30, x1>10), and so on.
Also see Using Join Index Statistics to Estimate SingleTable Expression Cardinalities for information about how the Optimizer can use singletable join index statistics to estimate the cardinalities of complex expressions that reference base table columns.
Using SingleColumn and Multicolumn Statistics to Discover Hierarchical Relationships in Denormalized Schemas
The derived statistics framework discovers hierarchical relationships for denormalized tables using statistics. The relationships are saved as value mappings in the derived statistics. The only mappings used to adjust demographics are those with 1 > n relationships.
A combination of single and multicolumn statistics is required to detect these relationships.
A change to a single column cascades through the entire hierarchical chain. For example, if the value mapping for x > y is 1 > 5 and the value mapping for y> z is 1 > 10, then if one value is removed from x, 5 values are removed from y, and 10 values are removed from z.
More concretely, if the relationship between region > nation is discovered to be 1 > 5, then if one region is selected, the Optimizer detects that only 5 nations qualify.
In other words, if one region is disqualified by a singletable predicate, the Optimizer removes 5 nations from the nation column.
The following example illustrates this logic. Consider the denormalized dimension table customer_nation_region, which is defined as follows:
CREATE SET TABLE cust_nation_region ( c_custkey INTEGER, c_name VARCHAR(25) CHARACTER SET LATIN CASESPECIFIC, c_address VARCHAR(40) CHARACTER SET LATIN CASESPECIFIC, c_nationkey INTEGER, c_phone CHAR(15) CHARACTER SET LATIN CASESPECIFIC, c_acctbal DECIMAL(15,2), c_mktsegment CHAR(10) CHARACTER SET LATIN CASESPECIFIC, c_comment VARCHAR(117) CHARACTER SET LATIN CASESPECIFIC, c_maritalstatus CHAR(1) CHARACTER SET LATIN NOT CASESPECIFIC, n_name CHAR(25) CHARACTER SET LATIN CASESPECIFIC, n_comment VARCHAR(152) CHARACTER SET LATIN CASESPECIFIC, r_regionkey INTEGER, r_name CHAR(25) CHARACTER SET LATIN CASESPECIFIC, r_comment VARCHAR(152) CHARACTER SET LATIN CASESPECIFIC) PRIMARY INDEX (c_custkey);
 Singlecolumn on r_regionkey
 Singlecolumn on r_regionkey
 Singlecolumn on r_regionkey
 Multicolumn on (r_regionkey, n_nationkey)
 Multicolumn on (n_nationkey, c_custkey)
 region > nation is 1 > 5
 nation > customer is 1 > 24,000
EXPLAIN SELECT n_nationkey FROM cust_nation_region WHERE r_regionkey = 1 GROUP BY 1;The following is a portion of the EXPLAIN output:
... 2) Next, we lock TPCD_OCES3.cust_nation_region for read.3) We do an allAMPs SUM step to aggregate from TPCD_OCES3.cust_nation_region by way of an allrows scan with a condition of ("TPCD_OCES3.cust_nation_region.R_REGIONKEY = 1"), grouping by field1 ( TPCD_OCES3.cust_nation_region.N_NATIONKEY). Aggregate Intermediate Results are computed globally, then placed in Spool 3. The size of Spool 3 is estimated with low confidence to be 5 rows (525 bytes). The estimated time for this step is 0.62 seconds. > The total estimated time is 0.63 seconds.
From the known hierarchical relationship between region and nation, which is 1 > 5, the predicted number of nationkey rows is 5, which is verified by the EXPLAIN text highlighted in boldface type. By generalizing, if you were to write a request that selected 2 regionkey values, there would be 10 nationkey rows in the result set, and so on.
Using Subset and Superset Statistics to Derive Column Demographics
If there are multiple join predicates, and the join columns have some degree of correlation, multicolumn statistics are necessary to more accurately estimate join cardinality, rows per value, skew detection, and related statistics. Without having multicolumn statistics, and assuming column independence, the Optimizer multiplies the individual numbers of unique values to get an estimate of the combined number of unique values. However, if multicolumn statistics exist that are a superset of the join columns, they might provide more useful information about the column correlations than the column independence assumption would provide.
For example, consider the following query:
SELECT * FROM t1, t2 WHERE t1.d1 = t2.d2 AND t1.x1 = t2.x2;
The system requires multicolumn statistics on (t1.d1, t1.x1) and (t2.d2, t2.x2) to be able to make accurate cardinality and costing estimates. If there are individual statistics on t1.d1, with 100 unique values, and t1.x1, with 50 unique values, and assuming column independence, the unique values are multiplied to calculate the combined maximum number of unique values, 100 * 50 = 5,000, and having an upward bound of the total number of rows.
The derived statistics subsystem identifies such opportunities in the prejoin planning stage and derives dummy multicolumn statistics based on super sets. For example, if there is a multicolumn statistics collection that is a superset, such as (x1, y1, d1), with only 2,000 unique values, the Optimizer can cap the estimated unique value cardinality of 5,000 to a more accurate 2,000 by deriving a derived statistics entry (x1, y1) with 2,000 unique values, and then propagate the statistic to Join Planning.
Similarly, if there are individual entries and multicolumn statistics entries such as (x1), (d1) and (x1, d1) and singletable predicates such as x1 IN (1,2) AND d1 < 20, the Optimizer can update the demographics of the singlecolumn entries while making its singletable estimates. Based on the updated information from the singlecolumn entries, the multicolumn entry of derived statistics is updated and propagated.
Using Join Predicates to Derive Column Demographics
The Optimizer assumes join uniformity when it estimates join cardinalities and costs. In other words, it assumes that every value from the left relation in the join finds a match in the right relation if the number of values in the left relation is less than the number of right values in the right relation, and vice versa.
Based on this assumption, and given the join predicate t1.d1 = t2.d2, if t1.d1 has 100 unique values and t2.d2 has 50 unique values, then after the join, t1.d1 should have only 50 unique values. The Optimizer also considers multicolumn statistics for making adjustments. For example, given the following predicate and (x1, y1) with 100 unique values and (x2, y2) with 50 unique values, after the join, the number of (x1, y1) unique values is adjusted to 50.
t1.x1=t2.x2 AND t1.y1=t2.y2
The derived statistics subsystem performs this kind of analysis and adjusts the demographics of the join columns such as the number of unique values and the high modal frequency after each binary join and then propagates the adjusted demographic information to the next stages of the join planning process.
If column correlation information is available, the Optimizer uses it to adjust the other column demographics based on the adjustments to the join column. Otherwise, the join column is assumed to be independent of the other columns.
Assume that t1.d1 has 100 unique values, t2.d2 has 50 unique values, and t3.d3 has 200 unique values.
Suppose you submit the following query:SELECT t3.d3, COUNT (*) FROM t1, t2, t3 WHERE t1.d1 = t2.d2 AND t1.d1 = t3.d3 GROUP BY 1;
Assuming the join order is (t1 x t2) x t3, after the first join, the derived statistics subsystem adjusts the number of unique values of t1.d1 to the minimum number of unique values (t1.d1, t2.d2) = 50.
After the second join, the derived statistics subsystem adjusts the number of unique values of t3.d3 to 50, which is then used in the final aggregate estimate because t3.d3 is specified as a grouping column.
Using Join Predicate Redundancy to Derive Column Demographics After Each Binary Join
Some join predicates can become redundant after the system has made one or more binary joins. A predicate is redundant when it adds nothing to the overall selectivity for a query because its effect is equivalent to one or more other predicates that are also specified in that query.
The redundancy can result from either transitive closure or from userdefined query conditions. If such redundant predicates are not properly identified and handled during cardinality estimates and costing, they can have a negative effect on the optimization of the query, possibly leading to nonoptimal query plans.
For example, suppose you submit the following query:
SELECT * FROM t1, t2, t3 WHERE t1.d1 = t2.d2 AND t1.d1 = t3.d3;
Transitive closure derives a new predicate t2.d2 = t3.d3 for this request. Assuming the join order (t1 x t2 > j1) x t3, the join predicates for the second join are j1.d1 = t3.d3 and j1.d2 = t3.d3.
For more information about transitive closure, see Predicate Simplification.
While estimating the number of unique values of (d1,d2) for join j1, the Optimizer needs to be aware that these 2 columns had already been equated in the previous join, and it should not multiply their individual unique values to get the combined number of unique values. Instead, it should use MIN(j1.d1, j1.d2) as the combined number of unique values.
The derived statistics infrastructure builds the appropriate entries by combining all the connected joined columns into EquiSets after every binary join. This way, the subsequent joins seamlessly handle the redundancy.
In join planning, an EquiSet is a set of columns that was equated in predicates from a previous join operation in the same query. Propagating EquiSets to subsequent join operations for reuse in a query is a fundamental component of the derived statistics framework.
For example, in the preceding example, an EquiSet derived statistics entry (j1.d1, j1.d2) exists after join j1, with the minimum number of values of d1 and d2. When the subsequent join requests the unique values for the combination (j1.d1, j1.d2) the Optimizer automatically uses the existing EquiSet entry.
Handling Multiple Sources of Information to Derive Column Demographics
As demonstrated in the preceding topics, information such as the number of unique values and the high modal frequency can be taken or derived from multiple sources such as singletable join indexes, multicolumn statistics, dynamic AMP samples, and interval histograms, among others.
 The information source captures the highest correlation.
 The information source covers the greatest number of singletable predicates.
 The information source is the most current (or, phrased another way, the least stale) source of information available.
For example, it is possible to have a join index producing noncorrelated demographic information covering multiple singletable predicates, while at the same time there might be a multicolumn statistic covering one singletable predicate that produces highly correlated information for a join column.
 The first two criteria, highest correlation and greatest
number of singletable predicates, can be quantified by using the number of
unique values.
This translates to using the information source that provides the fewest unique values.
 The third criterion, least stale statistics, is also considered and explained in Stale Statistics.
Propagating Column Demographics To All Temporary and Committed Joins
Column demographics are propagated to all temporary and committed joins using the derived statistics framework. As a result, demographics can be dynamically adjusted after each join, and the Join Planner does not need to reference base table interval histograms to retrieve the statistics.
Propagating Column Demographics For Materialized Instances Within a Query Block
If you specify complex predicates or outer joins in a query, they might become materialize into spool relations while the Optimizer is performing join planning within a query block.
Suppose you submit the following outer join query with a complex ON clause predicate:
SELECT * FROM t1 INNER JOIN t2 ON (x1=x2) LEFT OUTER JOIN t3 ON (x2 NOT IN (SELECT x4 FROM t4) OR y3=10);
The ON clause predicate x2 NOT IN (SELECT x4 FROM t4) OR y3=10 in this query makes this a complex outer join (if a subquery is specified in an ON clause, the predicate is classified as either semicomplex or complex).
One of the requirements for processing a complex outer join is to have only one left relation, so the left relations (t1, t2) are materialized into a single spool relation before the system processes the complex join. After materializing the left relations, derived statistics propagates the projected column demographic information for the materialized instance to the subsequent stages of join planning.
Propagating Column Demographics For Aggregate and OrderedAnalytic Function Estimates
To be able to do aggregate and orderedanalytic cardinality estimates after a series of joins, the demographic information of the grouping columns needs to be propagated through the join operations until final aggregations are done. The derived statistics subsystem adjusts the demographic information based on singletable and join predicates and then propagates it forward until final aggregation completes. For example, suppose you submit the following query:
SELECT t1.x1, t2.x2, COUNT (*) FROM t1, t2 WHERE t1.x1 =10 AND t1.y2 = t2.x2 GROUP BY 1, 2;
In the prejoin planning stage, derived statistics detects the singletable equality term on column x1 (t1.x1=10) and adjusts the number of unique values to 1. Also while doing join planning, the number of unique values of t2.x2 is adjusted to the minimum number of values of (t1.y2, t2.x2). This improves the aggregate cardinality estimates, which can benefit the overall plan if this query block is a spooled derived table or a view.
Propagating Column Demographics Across Query Blocks
Suppose you submit the following query:
SELECT * FROM t1, (SELECT x2, x3, COUNT(*) FROM t2, t3 WHERE x2=10 AND t2.x3=t3.x3 GROUP BY 1, 2) AS dt WHERE t1.x1=dt.x3;
The derived statistics framework carries the column demographic information of the derived table dt for columns x2 and x3 by making adjustments based on the derived table predicates. The adjusted demographics are then propagated to the join planning for the outer block.
Using Histograms to Estimate Cardinalities
 UPPER
 LOWER
 NULLIFZERO
 ZEROIFNULL
 SUBSTR
 MOD
 CONCAT on columns of same table
 Implicit or explicit data type conversions on a column
 =
 ≥
 ≤
 >
 <
 <>
 IS NULL
 IS NOT NULL
 LIKE
 NOT LIKE
 BETWEEN
 IN
 Constants whose value can be calculated by the Parser.
 System USING request modifier data or builtin function data such as a CURRENT_DATE value.
 Simple expressions involving another column from the same
table.
For example, the selectivity of the singletable predicate t1.x1 = t1.y1 can be estimated more reasonably by considering the overlapping values of those 2 columns than by using a default selectivity formula.
Selectivity estimates for LIKE predicates is limited to "abc%" patterns and very conservative estimation formulas.
 The predicates are specified on the first n fields of the multicolumn histogram.
This rule exists because there is an ordering dependency of the fields of a multicolumn histogram.
 The predicates must specify an equality condition except for the first field of the multicolumn histogram.
 If the predicate on the first column of a multicolumn is a nonequality condition, then the Optimizer uses the multicolumn histogram for this predicate only.
For example, a histogram on (x, y, z) can be used to estimate the selectivity for predicate x>100 as well as x=10 AND y=20.
The Optimizer can also use the data derived by date extrapolation to enhance its cardinality estimates for daterelated predicates (see Using Extrapolation to Replace Stale Statistics).
Combining Selectivities
 Every value of one column maps to all values of the other column.
The number of combined values is the product of the number of values of individual column. Take column nation and column marketsegment of the customer table as an example.
Each nation participates in the business of all market segments, while business in each market segment is provided in all countries.
 The value of one column is not constrained by the value of
the other column, but given a value of the first column of 2 independent
columns, the possible value of the first column is evenly distributed across
all values of the second column.
Column marketsegment and column accountbalance of the customer table are an example of such independent columns. On the one hand, the account balance is not bounded by a specific market segment even though one account balance, say $100.23, does not exist in all market segments.
 When the total size of all columns exceeds the histogram data row size limit of 16 bytes. Multicolumn histogram does not have all the information necessary to make an accurate determination of column independence, so the test for independence could fail.
 When independence detection is activated only when the confidence of the individual selectivities is high. That is, for predicates whose selectivity is not high confidence because no statistics have been collected on the columns, or because an expression is complex, the Optimizer does not attempt to detect column independence.
 When independence detection is activated only when the selectivity estimation of a predicate on an individual column is based on the base table histogram.
Finding the Number of Unique Values in a Set of Join or Grouping Columns
Join cardinality estimates, rows per value estimates, skew detection, aggregate estimates, Partial Group By estimates, and several other computations all require estimating the number of unique values for a given set of join or grouping columns.
The Optimizer makes an exhaustive search of all possible combinations of statistics to determine the best set of nonoverlapping statistics. Once it has been decided, that set is then used to find the Number of Unique Values, High Modal Frequency, and High AMP Frequency values at all stages of the optimization process.
 Find the set of nonoverlapping combinations of statistics
that has the least number of unique values.
If no complete coverage can be found, find the set of nonoverlapping combinations of statistics that covers the largest number of columns and has the least number of unique values.
 Find the set of nonoverlapping combinations of statistics that provides the highest High Modal Frequency and High AMP Frequency values by covering all the columns.
The set of combinations that provides the smallest Number of Unique Values might not be the same set that provides the best estimates for the High Modal Frequency and High AMP Frequency values because the High Modal Frequency and High AMP Frequency might not be available for some of the entries that are derived from sources other than base table interval histograms.
The number of unique values determined from partial statistics, where statistics do not cover all the hash or join columns, is considered for Rows Per Value and join cardinality estimates because it is a conservative approach, but the Optimizer does not consider this estimate for skew adjustment and Partial GROUP BY because it makes very aggressive cost estimates and, as a result, can cause performance problems when the costing estimates are grossly in error because of the overly aggressive method by which they were calculated. In other words, the Optimizer avoids Partial GROUP BY plans and does not attempt to make skew adjustments when the given hash or join columns are not fully covered by statistics.
 MinVals and its confidence level
This estimate provides the absolute minimum number of values for the given collection of columns. The values are taken from a single derived statistics entry that covers the largest number of columns. If there are multiple entries that cover the same number of columns, the Optimizer selects the entry with the highest number of values.
The confidence levels for various entries are described in the following table.
IF a usable derived statistics entry is … The confidence level is … found High confidence not found No confidence  BestVals and its confidence level This provides the best number of values estimate that can be derived from the set of derived statistics entries that meets both of the following criteria.
 Covers the greatest number of columns
 Produces the least number of values
Derived statistics entries that are either the same set, or a subset, of the given column collection are used to produce these values.
The values can even be taken from a set of derived statistics entries that covers only a portion of the columns in some cases. This can happen when, for example, there are insufficient statistics to cover all the columns.
The confidence levels for various entries are described in the following table:
FOR this derived statistics entry situation … The confidence level is … a single entry covers all the columns High confidence either of the following.  multiple entries must be combined to cover all the columns
 only a subset of the columns is covered
Low confidence no usable derived statistics entries are found No confidence  MaxVals and its confidence level
This estimate provides the maximum number of possible values for the given collection of columns. The derived statistics entries that are either a subset, a superset, an EquiSet, or intersecting entries are all considered for producing these values.
If all columns are not covered by these entries, then default estimates based on the domain type are used for the noncovered columns, as described in the following table.
Derived Statistics Entry Situation Confidence Level a single entry covers all the columns High confidence multiple entries must be combined to cover all the columns Low confidence default estimates are used to compute the values No confidence
If MinVals is found with a High or Low level of confidence, then the Optimizer can always determine a BestVals statistic. However, it is also possible to determine MaxVals with High or Low confidence, but not be able to determine a BestVals or MinVals.
The Optimizer uses BestVals for its join cardinality and RowsPerValue estimates if it has Low or High confidence.
Partial GROUP BY, aggregations, and skew detection always use MaxVals.
The values are combined in 2 levels using a stochastic model, as follows:
 The values from the same source are combined.
The number of combined values is limited to the total rows of the source.
 The combined values from different sources are themselves
combined to get the final values.
These are limited based the total rows for the current set.
The system consumes the EquiSet entries and then derives additional combinations if any multicolumn statistics intersect with the EquiSets. For example, if the usable entries for the given hash columns (x1, x2, y2) are EquiSet [x1, x2] and (x2, y2)], then the Optimizer augments the multicolumn statistics entry (x2, y2) with other EquiSet columns, producing the new entry (x1, x2, y2) dynamically.
The following examples describe the different possibilities.
For the first example, consider the following derived statistics entries:
Column Set  Number of Unique Values 

(a1, b1)  10 
(b1, c1)  15 
(a1, b1, c1)  20 
If the hash columns are (a1, b1, c1), then the Optimizer derives the following values and confidence levels:
Statistic  Value  Confidence Level 

MinVals  20  High confidence 
BestVals  20  High confidence 
MaxVals  20  High confidence 
For the second example, consider the following derived statistics entries:
Column Set  Number of Unique Values 

(a1, b1)  10 
(c1)  5 
If the hash columns are (a1, b1, c1, d1), then the Optimizer derives the following values and confidence levels:
Statistic  Value  Confidence Level 

MinVals  10  High confidence 
BestVals  50 Calculated from the product of the numbers of unique values for the column sets: 10 x 5 = 50. 
Low confidence 
MaxVals  Combination of 50 and the demographic estimate for d1.  No confidence 
For the last example, consider the following derived statistics entry:
Column Set  Number of Unique Values 

(a1, b1, c1, d1)  100 
If the hash columns are (a1, b1, c1), then the Optimizer derives the following values and confidence levels:
Statistic  Value  Confidence Level 

MinVals  TotalRows  No confidence 
BestVals  TotalRows  No confidence 
MaxVals  100  Low confidence 