Uses of Dynamic AMP Sampling by Teradata Database
The Optimizer generally relies on statistics collected on primary indexes and row partitions along with UDI counts to make cardinality estimates rather than using dynamic AMP samples because the data provided by UDI counts tends to be accurate than that obtained from dynamic AMP samples. However, there continue to be cases where dynamic AMP sampling has to be relied upon to make cardinality estimates.
When there are no statistics available to quantify the demographics of a column set or index, the Optimizer can select a single AMP to sample dynamically for statistics using an algorithm based on the table ID.
By inference, these numbers are then assumed to represent the global statistics for the column or index.
The cardinality estimates collected by a dynamic AMP sample are stored in the NumRows column of the file system data block descriptor (DBD) for the table as well as in the OneAMPSampleEst and AllAMPSampleEst fields when collected from the primary index or from PARTITION statistics for a table.
Teradata Database collects a new dynamic AMP sample each time a given DBD is fetched, but if interval histogram statistics exist for the columns or indexes of interest, they override the statistics collected by this dynamic AMP sample by default.
Note that the statistics collected by a dynamic AMP sample apply to indexed columns only. If you do not collect statistics on non-indexed columns, then the Optimizer uses various situation-specific heuristics to provide arbitrary cardinality estimates.
Do not think of dynamic AMP samples as a substitute for collecting full-table statistics. For example, while it is true that a dynamic AMP sample generally produces reasonable estimates of the cardinality and number of unique values for a column or index set, it does not generally produce a good estimate of the selectivity of a given predicate in a query.
Assumptions Underlying the Method
The fundamental, closely related, assumptions underlying dynamic AMP sampling are the following:
- The sampled table has enough rows that a snapshot of any one AMP in the system accurately characterizes the demographics for the entire table across all AMPs.
An important implication of this assumption is that dynamic AMP samples can be poor estimators of the true population statistics of a table whenever the cardinality of the table is less than the number of AMPs on the system. Because of this, you should always collect statistics on small tables.
- The rows of the sampled table are distributed evenly, without skew.
If the data is skewed, a dynamic AMP sample can provide significantly erroneous demographics. System UDI counts provide the Optimizer with far better cardinality estimates than dynamic AMP sampling can for tables with skewed distributions.
- The rows of the sampled table are not compressed.
If the columns of an index or column set are compressed using multi-value compression, algorithmic compression, block-level compression, or row-compressed hash and join indexes, the demographics returned by a dynamic AMP sample are not reliable because the Optimizer tends to underestimate the cost of reading a compressed table.
Even though dynamic AMP sampling provides reasonable cardinality estimates for many scenarios, it is prone to significant errors for tables compressed using algorithmic compression, block-level compression, column-partitioned tables, and compressed hash and join indexes, and are ultimately unusable.
For cardinality estimates on these sorts of tables, system UDI counts can provide the Optimizer with accurate change data based on the last time their statistics were updated.
- The rows are not from a column-partitioned table.
As noted in the previous bullet, the statistics returned by a dynamic AMP sample of the rows of a column-partitioned table, like those returned from compressed column sets, are not reliable, and system UDI counts provide consistently more accurate cardinality estimates.
When these assumptions are valid, the cardinalities estimated using dynamic AMP sampling are generally fairly accurate. Sampled statistics do not work well with the partitioning columns of row-partitioned tables, however, and should not be used to collect statistics for them.
Estimating Cardinalities From a Single-AMP Dynamic Sample
The process used to gather statistics on an indexed column using a dynamic AMP sample is as follows:
- Select an AMP to be used to collect a cardinality estimate by hashing on the value of the table ID to generate an AMP number.
- Read the master index to locate the cylinders containing data for the desired table, column, or index.
- Count the number of cylinders that contain the desired data.
- Randomly select one of those cylinders and read its cylinder index to locate the data blocks containing data for the desired table, column, or index.
- Count the number of data blocks in the cylinder that contain the desired data.
- Randomly select one of those data blocks and count the number of rows it contains.
- Compute an estimate of the cardinality of the table using the following equation:
Table cardinality, the average cardinality per value, and, for indexes, the number of distinct values, the average cardinality for each index, and the average size of each index on each AMP are the only statistics estimated by a dynamic AMP sample.
In the case of a NUSI, the cardinality estimate is for the number of index rows that correspond to the number of distinct values in the index.
As a result, the potential for error introduced by dynamic AMP sampling from a small table (too few rows per AMP to provide an accurate measure) or a skewed value set is much higher than the estimates gained by the full-table statistics gathered using the Optimizer form of the COLLECT STATISTICS statement. (For a list of those statistics, see Statistics and Demographics Collected and Computed.) The term skewed here means having outlier values that bias the value distribution in the statistical sense. It does not refer to an unbalanced distribution of table rows among the AMPs of a system.
Because the Optimizer understands that the statistics and demographic information it collects using a single AMP sample is less reliable, it assumes Low confidence (as reported by an EXPLAIN of a query where dynamic AMP sampling would be used) in the outcome and is less aggressive in its pursuit of optimal query plans, particularly join strategies.
Actually, the EXPLAIN text would report No confidence for this example because there are no statistics on the condition. When the request is actually performed, Teradata Database would perform a dynamic AMP sample, and the resulting cardinality estimate would then be expressed with Low confidence. See About Optimizer Confidence Levels for more details about Optimizer confidence levels.
This means that in some cases, the resulting query plan is more costly than a plan developed from the more extensive, accurate statistics maintained in the interval histograms by collecting full statistics as necessary.
Estimating Cardinalities From a Multiple-AMP Dynamic Sample
The only procedural difference between dynamic single-AMP and dynamic multiple-AMP samples involves the selection of which AMPs to sample in addition to the initial AMP identified by hashing the table ID.
Once the first AMP has been identified by hashing on the table ID value, the next n AMPs are selected in increasing order of their AMP IDs. Teradata Database selects sequential AMP IDs to obtain the most efficient load distribution. If the first AMP selected is near the end of the AMP ID series, the system wraps the selection to the beginning of the series.
For example, consider a 10 AMP system where the default number of AMPs to sample is 5. The first AMP selected by the table ID is 6. The AMPs selected to complete the sample of five would be those with AMP IDs of 7, 8, 9, and 0.
Similarly, given the same system but a different table ID, suppose the AMP ID selected was 9. The AMPs selected to complete the sample of five would be those with AMP IDs of 0, 1, 2, and 3.
Note that the system uses a Last Done Channel technique to reduce messaging overhead for dynamic AMP samples taken from all AMPs. This method is used when all-AMP sampling is enabled, which is the default. The following graphic illustrates the Last Done Channel all-AMP sampling method:
It is important to realize that even multiple-AMP sampling is not a replacement for collecting complete statistics.
It is equally important to understand that sampling from an AMP subset does not guarantee that the subset is representative of the data across all AMPs. It is even possible, though not likely, that a subset sample can produce statistics that are less representative of the population than statistics produced by sampling a single AMP.
Dynamic AMP Sampling of USIs
Dynamic AMP sampling assumes that the number of distinct values in a USI equals its cardinality, so it does not read the index subtable for USI equality conditions. The number of distinct values in the USI is assumed to identical to the table cardinality taken from the dynamic AMP sample on the primary index.
Because equality conditions on a unique index return only one row by definition, the Optimizer always chooses the direct USI path without costing it or using statistics. However, if a USI will frequently be specified in non-equality predicates, such as range constraints, then you should collect statistics on it.
Dynamic AMP Sampling of NUSIs
Dynamic AMP sampling for NUSIs is very efficient. The system reads the cylinder index that supports the index subtable rows on the sampled AMP and determines the number of rows on that cylinder. Except for situations where the NUSI is very nonunique, there is one subtable row for each distinct value in the NUSI. Following through on that knowledge, the sampling process assumes that each subtable row it finds translates to one index value.
The sampling process assumes the following assertions about the data are true.
- The same values occur on all AMPs
- The number of distinct values found on the sampled AMP represents the total number of distinct values for the NUSI
Because of these assumptions, dynamic AMP samples are less accurate for fairly singular NUSIs than they are for fairly nonunique NUSIs. If more than one AMP is sampled, the system calculates an average number of distinct values across all sampled AMPs.
NUSI estimates are always made after the total cardinality of the table has been estimated. The sampling process divides the estimated total cardinality by the NUSI cardinality to estimate the approximate number of rows per value in the index. If a request passes a single value in an equality condition to match to the NUSI, and skew, if present, is not extensive, then query plans are generally quite good.
The following table summarizes some cases illustrating dynamic-AMP sampled NUSI estimates where there is no significant skew in the data. Note that in the cases where there are very few rows per value in the index (c_phone and c_acctbal), the system makes less accurate cardinality estimates.
|Table Name||NUSI Column||Query Text||Actual Result Cardinality||Estimated Result Cardinality from Dynamic AMP Sample||Percent Difference||Estimated Cardinality from Collecting Full Statistics||Percent Difference||Percent Difference Between the Two Estimates|
SELECT * FROM parttbl WHERE p_type = ‘economy brushed copper’;
SELECT * FROM partsupp WHERE ps_suppkey = 555;
SELECT * FROM customer WHERE c_acctbal = 7944.22
SELECT * FROM customer WHERE c_phone = ‘25-548-367-9974’;
Sampling Efficiency for Non-Indexed Predicate Columns
While dynamic AMP sampling provides reasonable cardinality estimates for UPIs, USIs, and distinct NUSI values, it fails to provide useful cardinality estimates for unindexed predicate columns in a query. When the WHERE clause of a request specifies a predicate of zip_code=90230 and there are no statistics on the column zip_code, Teradata Database uses various heuristics to apply default selectivities to estimate default cardinalities for those columns. For this example, the heuristic is to estimate the cardinality to be 10% of the table rows.
You should collect statistics on frequently selected columns, particularly if their values are skewed, rather than relying on the default cardinality estimates.
The table on the next page shows three queries with simple selection criteria. The actual response cardinalities from the queries are presented, along with the cardinality estimates derived from the heuristic defaults, the cardinality estimates derived from the collection of full statistics, and the percentage difference between the 2 estimates. There is considerable divergence between the true cardinalities and the heuristic estimates in all cases, while the estimates made by collecting full statistics are very close to reality in all cases.
If you compare the actual cardinalities to the default estimates, you can see that with a single selection column in an equality condition, the Optimizer assumes that about 10% of the rows in the table will be returned. With 2 selection criteria, as seen in the third request in the table, the Optimizer assumes that about 7.5% of the rows will be returned. In every case, the default cardinality estimates significantly overestimate the number of rows returned, potentially leading to poor join planning if the example selection criteria were part of a more complex query where additional tables were involved in join operations with them.
If NUSIs had been defined on those columns, the Optimizer would use a dynamic AMP sample estimate of their cardinalities even if the NUSI column had not been used to develop the query plan.
For selection criteria on unindexed columns, the identical poor cardinality estimates are made whether dynamic AMP sampling is defined for one, few, or all AMPs in the system. The cardinality estimates for unindexed columns do not improve when all AMPs are sampled because Teradata Database always samples 10% of the table rows irrespective of the number of AMPs on which the sampling is done.
The significant conclusion to be drawn from these results is that it is important to collect statistics on non-indexed columns that are frequently specified in selection predicates. Dynamic AMP samples never provide adequate cardinality estimates for unindexed columns, even if those columns are not skewed.
|Query Text||Actual Table Cardinality||Actual Response Set Cardinality||Heuristic Estimate of Response Set Cardinality||Percentage Difference||Full Statistics Estimate of Response Set Cardinality||Percentage Difference|
SELECT * FROM lineitem WHERE l_commitdate = ‘1998-01-06’;
SELECT * FROM partsupp WHERE ps_availqty = 7874;
SELECT * FROM parttbl WHERE p_size = 22 AND p_brand = ‘Brand#23’;