Dynamic AMP Sampling | Optimizer Process | VantageCloud Lake - Dynamic AMP Sampling - Teradata Vantage

Teradata® VantageCloud Lake

Deployment
VantageCloud
Edition
Lake
Product
Teradata Vantage
Published
January 2023
ft:locale
en-US
ft:lastEdition
2024-12-11
dita:mapPath
phg1621910019905.ditamap
dita:ditavalPath
pny1626732985837.ditaval
dita:id
phg1621910019905

Uses of Dynamic AMP Sampling by Vantage

The Optimizer relies on statistics collected on primary indexes and row partitions with UDI counts to make cardinality estimates rather than using dynamic AMP samples, because the data provided by UDI counts is typically more accurate than that from dynamic AMP samples. However, there are cases that rely on dynamic AMP sampling to make cardinality estimates.

When there are no statistics available to quantify the demographics of a column set or index, the Optimizer can select a single AMP to sample dynamically for statistics using an algorithm based on the table ID.

By inference, these numbers are then assumed to represent the global statistics for the column or index.

The cardinality estimates collected by a dynamic AMP sample are stored in the NumRows column of the file system data block descriptor (DBD) for the table and in the OneAMPSampleEst and AllAMPSampleEst fields when collected from the primary index or from PARTITION statistics for a table.

Vantage collects a new dynamic AMP sample each time a given DBD is fetched, but any interval histogram statistics for the columns or indexes of interest override the statistics collected by this dynamic AMP sample by default.

The statistics collected by a dynamic AMP sample apply to indexed columns only. If you do not collect statistics on non-indexed columns, then the Optimizer uses situation-specific heuristics to provide arbitrary cardinality estimates.

Do not think of dynamic AMP samples as a substitute for collecting full-table statistics. A dynamic AMP sample produces reasonable estimates of the cardinality and number of unique values for a column or index set, but does not produce a good estimate of the selectivity of a given predicate in a query.

Assumptions Underlying the Method

The fundamental, closely related, assumptions underlying dynamic AMP sampling are the following:
  • The sampled table has enough rows that a snapshot of any one AMP in the system accurately characterizes the demographics for the entire table across all AMPs.

    An important implication of this assumption is that dynamic AMP samples can be poor estimators of the true population statistics of a table whenever the cardinality of the table is less than the number of AMPs on the system. Therefore, collect statistics on small tables.

  • The rows of the sampled table are distributed evenly, without skew.

    If the data is skewed, a dynamic AMP sample can provide significantly erroneous demographics. System UDI counts provide the Optimizer with far better cardinality estimates than dynamic AMP sampling can for tables with skewed distributions.

  • The rows of the sampled table are not compressed.

    If the columns of an index or column set are compressed using multivalue compression, algorithmic compression, block-level compression, or row-compressed join indexes, the demographics returned by a dynamic AMP sample are not reliable because the Optimizer tends to underestimate the cost of reading a compressed table.

    Dynamic AMP sampling provides reasonable cardinality estimates, but is prone to significant errors for tables compressed using algorithmic compression, block-level compression, column-partitioned tables, and compressed join indexes, and are ultimately unusable.

    For cardinality estimates on these sorts of tables, system UDI counts can provide the Optimizer with accurate change data based on the last time their statistics were updated.

  • The rows are not from a column-partitioned table.

    As noted in the previous bullet, the statistics returned by a dynamic AMP sample of the rows of a column-partitioned table, like those returned from compressed column sets, are not reliable, and system UDI counts provide consistently more accurate cardinality estimates.

When these assumptions are valid, the cardinalities estimated using dynamic AMP sampling may be accurate. Sampled statistics do not work well with the partitioning columns of row-partitioned tables, however, and must not be used to collect statistics for them.

Estimating Cardinalities from a Single-AMP Dynamic Sample

The process used to gather statistics on an indexed column using a dynamic AMP sample is as follows:

  1. Select an AMP to be used to collect a cardinality estimate by hashing on the value of the table ID to generate an AMP number.
  2. Read the master index to locate the cylinders containing data for the desired table, column, or index.
  3. Count the number of cylinders that contain the desired data.
  4. Randomly select one of those cylinders and read its cylinder index to locate the data blocks containing data for the desired table, column, or index.
  5. Count the number of data blocks in the cylinder that contain the desired data.
  6. Randomly select one of those data blocks and count its rows.
  7. Compute an estimate of the cardinality of the table using the following equation:

Table cardinality, the average cardinality per value, and, for indexes, the number of distinct values, the average cardinality for each index, and the average size of each index on each AMP are the only statistics estimated by a dynamic AMP sample.


Estimated cardinality of table equation

In the case of a NUSI, the cardinality estimate is for the number of index rows that correspond to the number of distinct values in the index.

Therefore, the potential for error introduced by dynamic AMP sampling from a small table (too few rows per AMP to provide an accurate measure) or a skewed value set is much higher than the estimates gained by the full-table statistics gathered using the Optimizer form of the COLLECT STATISTICS statement. (For a list of those statistics, see Statistics and Demographics Collected and Computed.) Here, the term skewed means having outlier values that bias the value distribution in the statistical sense.

The Optimizer understands that the statistics and demographic information collected using a single AMP sample is less reliable, and therefore assumes Low confidence (as reported by an EXPLAIN of a query where dynamic AMP sampling is used) in the outcome and is less aggressive in its pursuit of optimal query plans, particularly join strategies.

The EXPLAIN text reports No confidence for this example because there are no statistics on the condition. When the request is performed, Vantage performs a dynamic AMP sample, and the resulting cardinality estimate is expressed with Low confidence. See About Optimizer Confidence Levels for more details about Optimizer confidence levels.

Therefore, the resulting query plan may be more costly than a plan developed from the more extensive, accurate statistics maintained in the interval histograms by collecting full statistics as necessary.

Estimating Cardinalities from a Multiple-AMP Dynamic Sample

The only procedural difference between dynamic single-AMP and dynamic multiple-AMP samples involves the selection of which AMPs to sample in addition to the initial AMP identified by hashing the table ID.

When the first AMP has been identified by hashing on the table ID value, the next n AMPs are selected in increasing order of their AMP IDs. Vantage selects sequential AMP IDs to get the most efficient load distribution. If the first AMP selected is near the end of the AMP ID series, the system wraps the selection to the beginning of the series.

Important:
  • Multiple-AMP sampling is not a replacement for collecting complete statistics.
  • Sampling from an AMP subset does not guarantee that the subset is representative of the data across all AMPs.

    Although likely, a subset sample can produce statistics that are less representative of the population than statistics produced by sampling a single AMP.

Dynamic AMP Sampling of USIs

Dynamic AMP sampling assumes that the number of distinct values in a USI equals its cardinality, and therefore does not read the index subtable for USI equality conditions. The number of distinct values in the USI is assumed to identical to the table cardinality taken from the dynamic AMP sample on the primary index.

Because equality conditions on a unique index return only one row by definition, the Optimizer chooses the direct USI path without costing the path or using statistics. However, if a USI is frequently specified in nonequality predicates, such as range constraints, you must collect statistics on it.

Dynamic AMP Sampling of NUSIs

Dynamic AMP sampling for NUSIs is efficient. The system reads the cylinder index that supports the index subtable rows on the sampled AMP and determines the number of rows on that cylinder. Except for situations where the NUSI is nonunique, there is one subtable row for each distinct value in the NUSI. Therefore, the sampling process assumes that each subtable row translates to one index value.

The sampling process assumes the following assertions about the data are true.
  • The same values occur on all AMPs
  • The number of distinct values found on the sampled AMP represents the total number of distinct values for the NUSI

Therefore, dynamic AMP samples are less accurate for singular NUSIs than for nonunique NUSIs. If more than one AMP is sampled, the system calculates an average number of distinct values across all sampled AMPs.

NUSI estimates are made after the total cardinality of the table has been estimated. The sampling process divides the estimated total cardinality by the NUSI cardinality to estimate the approximate number of rows per value in the index. If a request passes a single value in an equality condition to match to the NUSI, and skew, if present, is not extensive, then query plans are good.

The following table summarizes cases illustrating dynamic-AMP sampled NUSI estimates with no significant skew in the data. Where there are few rows per value in the index (c_phone and c_acctbal), the system makes less accurate cardinality estimates.

Table NUSI Col Query Text Result Cardinality Est. Result Cardinality (AMP Sample) % Difference Est. Result Cardinality (Full Stats) % Difference % Difference Between Estimates
parttbl p_type
SELECT *
FROM parttbl
WHERE p_type = ‘economy brushed copper’;
67,103 67,123 0.03 66,667 0.65 0.68
partsupp ps_suppkey
SELECT *
FROM partsupp
WHERE ps_suppkey = 555;
80 82 2.47 81 1.24 0.01
customer c_acctbal
SELECT *
FROM customer
WHERE c_acctbal = 7944.22
10 24 82.35 7 35.29 109.68
customer c_phone
SELECT *
FROM customer
WHERE c_phone = ‘25-548-367-9974’;
1 21 81.82 2 165.22 158.33

Sampling Efficiency for Non-Indexed Predicate Columns

Dynamic AMP sampling provides reasonable cardinality estimates for UPIs, USIs, and distinct NUSI values, but fails to provide useful cardinality estimates for unindexed predicate columns in a query. When the WHERE clause of a request specifies a predicate of zip_code=90230 and there are no statistics on the column zip_code, Vantage uses heuristics to apply default selectivities to estimate default cardinalities for those columns. For this example, the heuristic is to estimate the cardinality to be 10% of the table rows.

Collect statistics on frequently selected columns, particularly if their values are skewed, rather than relying on the default cardinality estimates.

The following table shows three queries with simple selection criteria. The response cardinalities from the queries are presented, with the cardinality estimates derived from the heuristic defaults, the cardinality estimates derived from the collection of full statistics, and the percentage difference between the estimates. There is divergence between the true cardinalities and the heuristic estimates, while the estimates made by collecting full statistics are close to reality.

Comparing the actual cardinalities to the default estimates shows that with a single selection column in an equality condition, the Optimizer assumes that about 10% of the rows in the table are returned. With 2 selection criteria, as seen in the third request in the table, the Optimizer assumes that about 7.5% of the rows are returned. In every case, the default cardinality estimates significantly overestimate the number of rows returned, potentially leading to poor join planning if the example selection criteria were part of a more complex query where additional tables were involved in join operations with them.

If NUSIs had been defined on those columns, the Optimizer uses a dynamic AMP sample estimate of their cardinalities even if the NUSI column had not been used to develop the query plan.

For selection criteria on unindexed columns, the identical poor cardinality estimates are made whether dynamic AMP sampling is defined for one, few, or all AMPs in the system. The cardinality estimates for unindexed columns do not improve when all AMPs are sampled because Vantage samples 10% of the table rows irrespective of the number of AMPs on which the sampling is done.

Conclusion: Collect statistics on nonindexed columns that are frequently specified in selection predicates. Dynamic AMP samples never provide adequate cardinality estimates for unindexed columns, even if those columns are not skewed.

Query Text Table Cardinality Response Set Cardinality Est. Response Set Cardinality

(Heuristic)

% Difference Est. Response Set Cardinality

(Full Stats)

% Difference
SELECT *
FROM lineitem
WHERE l_commitdate = ‘1998-01-06’;
300,005,811 123,959 30,017,594 198.36 124,591 0.25
SELECT *
FROM partsupp
WHERE ps_availqty = 7874;
40,000,000 3,936 3,991,526 199.61 4,000 1.62
SELECT *
FROM parttbl
WHERE p_size = 22
AND   p_brand = ‘Brand#23’;
10,000,000 8,061 750,398 195.75 7,997 0.80