How the AMP Software Collects Statistics | Optimizer Process | Teradata Vantage - How the AMP Software Collects Statistics - Advanced SQL Engine

When you collect statistics and demographics, the AMP software creates statistics rows in DBC.StatsTbl that stores various statistics and demographics that summarize the table columns specified by the COLLECT STATISTICS statement. These rows come in the following basic types:

Regular statistics rows, one or more per AMP
Null rows, zero or one per AMP
All-null rows, zero or one per AMP
Average rows per value, one per AMP

The information taken from these rows is used to populate the statistical interval histograms used by the Optimizer to make its initial cardinality estimates (see Interval Histograms).

The following statistics are global per histogram, so are reported only once:

Number of null rows
Number of all-null rows
Overall average of the average number of rows per value per AMP

You can report these statistics using a HELP STATISTICS request (see Teradata Vantage™ - SQL Data Definition Language Syntax and Examples, B035-1144). HELP STATISTICS reports only summary statistics. To report detailed statistics, use a SHOW STATISTICS request.

Null and All-Null Statistics and Demographics

Null and all-null statistics provide the following cardinality information:

Statistic	Definition
NumNulls	Number of rows in the column set for which one or more of the columns are null.
NumAllNulls	Number of rows in a composite column set for which all of the columns are null.

Vantage only collects the NumAllNulls statistic for composite column sets when all of the columns in the composite set are null. In other words, the only time it is possible to collect a NumAllNulls value is when you collect multicolumn statistics or collect statistics on a composite index, and none of the columns in that set have a value.

Consider the following example of what is meant by null and all-null rows. Suppose you create a 3-column USI for table t_ex on columns b, c, and d. When you collect statistics on the USI, one row is found in which all 3 columns making up the USI are null. This would be reported as an all-nulls instance in the statistics. This example is degenerate because there could never be more than one row having a USI with all null columns in a table, but it is useful as an example, even though it does not generalize.

The columns for this row would look something like the following, where nulls are represented by the QUESTION MARK character.

t_ex
a	b	c	d	e	f
PI	USI
355	?	?	?	25000.00	375000.00

This contrasts with the case where statistics have been collected on a composite column set, and one or more, but not necessarily all, of the columns on which those statistics have been collected for a given row are null.

Assume the same table and index definitions as before. When you collect statistics on the USI, any row found where one or two, but not all, of the three columns making up the USI are null would be reported as an occurrence of a null column in the statistics.

All of the following rows in t_ex would be reported as null fields, because all have at least one null in one of the USI columns, while only the first row would be reported for the all-null fields statistic.

t_ex
a	b	c	d	e	f
PI	USI
355	?	?	?	25000.00	375000.00	<-- all-null fields (tan)

685	?	155	?	45000.00	495000.00	<-- null fields (orange)
900	325	?	?	32000.00	400000.00
275	?	?	760	55000.00	575000.00
597	?	254	893	10000.00	150000.00
322	679	?	204	75000.00	650000.00
781	891	357	?	32000.00	400000.00

All-nulls describes the case where statistics have been collected on a composite column set, and all the columns on which those statistics have been collected for a given row are null. With this information, the Optimizer can more accurately estimate the true number of unique values in a set, thus enabling more accurate join costing.

For example, suppose you have table t1 with columns x1 and y1, and values for x1 and y1.

t1
x1	y1
10	10

20	?	<-- null fields (orange)
?	30

?	?	<-- all-null fields (tan)
?	?

Again, the all nulls composite fields are shaded orange and the partly null composite fields are shaded green.

If you could collect only the NumNulls statistic, then when you collected statistics on composite columns, and one or more of the columns was null, the row would be counted as a null row. This is not an issue for single-table cardinality estimation because a comparison against a null is always evaluated as FALSE and is treated as an unmatched row.

For example, if you could collect only the NumNulls statistic, the histogram on (x1, y1) would indicate that the number of nulls is 4, the number of unique values is 2, and the total number of rows is 5. If columns x and y are used as join columns, the Optimizer would evaluate the join costing by incorrectly assuming there are only 2 unique values in the data when there are actually 4. This can cause problems in scenarios such as redistribution costs, skew detection, and the like.

To circumvent this problem, Vantage computes the true number of distinct partial nulls and saves them in the histogram as NumPNulls.

The following equation calculates the true distinct values for partial nulls:

True distinct value for partial nulls equation

Equation Element	Description
NumUniqueValues	the number of true distinct values for partially null rows.
NumValues	the number of non-distinct values for partially null rows.
NumPNullValues	the number of true distinct values among the partially null rows.
1	an adjustment for the estimation error when the cardinality of all nulls > 0.

To make this computation, Vantage injects a marker and uses the following aggregation query to collect statistics on (x1,y1). For the following request, the fieldID for null_marker is set to 2 and the fieldID for cnt is set to 3:

SELECT CASE
           WHEN x1 IS NULL
           AND  y1 IS NULL
           THEN 1
           ELSE IF x1 IS NULL
           OR
                   y1 IS NULL
           THEN 2
       ELSE 0
       END AS null_marker ,x1, y1, COUNT(*) AS cnt
FROM t_coll_stats
GROUP BY 1, 2;

Sampled Statistics

To collect sampled statistics, Vantage generates a retrieve request to do the sampling. It then passes the sampled rows to the subsequent aggregation step. The retrieve request can use TOPn operations to retrieve the rows if the table is hash distributed and either hard ordered or a NoPI table. For PPI tables that are sorted on their RowKey value, Vantage can generate a true sample step.

PARTITION Statistics

The aggregation step builds a detailed row with the partition number and the cardinality for queries such as the following:

SELECT PARTITION, COUNT(*)
FROM t_coll_stats;

UDF Statistics

You can collect statistics on deterministic UDFs, but not on nondeterministic UDFs.

Consider the following deterministic UDF:

CREATE FUNCTION months_between(date1 TIMESTAMP, date2 TIMESTAMP)
RETURNS FLOAT
LANGUAGE C
NO SQL
DETERMINISTIC
SPECIFIC months_between_tt
EXTERNAL NAME 'CS!months_between_tt!$PGMPATH$/months_between_tt.c'
PARAMETER STYLE SQL;

The following statement collects statistics on UDF months_between():

COLLECT STATISTICS
COLUMN months_between(BEGIN(policy_duration),
                      END(policy_duration)) AS 
                      Stats_MthsBetweenBegAndEnd
ON policy_types;

The crucial issue for UDF statistics is not collecting them, but dropping them. If you drop a deterministic UDF after you have collected statistics on it, those statistics become unusable. Because UDFs do not belong to any table or column, it is very expensive to drop all the statistics collected on them. Instead, Vantage uses the following lazy approach to dropping statistics that have been collected on deterministic UDFs:

When Vantage loads statistics header information from the DBC.StatsTbl for a table to handle a user query or another statistics-related statement, the system detects nonvalid expressions at the time the query is resolved.
For statistics collected on an unresolvable expression, the system updates the validStats field to FALSE for the corresponding statistics in DBC.StatsTbl.
When retrieving histograms, Vantage only retrieves those for valid statistics. This minimizes the overhead of parsing nonvalid statistics during query processing time.
The SHOW STATISTICS and HELP STATISTICS statements report only valid statistics.
A DROP STATISTICS request on a table forces Vantage to drop all statistics, both valid and nonvalid.
A SHOW STATISTICS COLUMN request returns a warning message if it is made on nonvalid statistics.
A DROP STATISTICS COLUMN request returns a warning message if it is made on nonvalid statistics.
Recollecting statistics on a UDF revalidates the statistics if the UDF exists at the time the statistics are recollected.
You can query the DBC.StatsV system view to determine nonvalid statistics by checking the validStats column
You cannot drop individual nonvalid statistics.

To drop nonvalid statistics, you must first drop all statistics on the table and then recollect valid statistics.
If you drop a UDF and then recreate it to return a different type, Vantage ignores the existing statistics for that function during optimization.

Average AMP Rows Per Value Statistic

The average AMP rows per value statistic is an exact system-wide average of the cardinality per value for each individual AMP over the number of rows per value for a NUSI column set on which statistics have been collected. This statistic is computed only for NUSI columns and is used for nested join costing.

Example EXPLAIN Text for First Time Statistics Collection

The following EXPLAIN text reports the steps taken to build a histogram and collect statistics for the first time on column x1 from table t1 with embedded comments to clarify exactly what happens at critical steps in the process.

EXPLAIN COLLECT STATISTICS COLUMN x1 ON t1;

Explanation
-----------------------------------------------------------------------
  1) First, we lock QSTATS.t1 for access.
  2) Next, we do an all-AMPs SUM step to aggregate from DF2.t1 by way 
     of an all-rows scan with no residual conditions,
     grouping by field1 ( RK.t1.x1). Aggregate Intermediate Results
     are computed locally, then placed in Spool 9. The size of Spool 9
     is estimated with low confidence to be 2 rows (44 bytes). The
     estimated time for this step is 0.03 seconds.
  3) We do an all-AMPs RETRIEVE step from Spool 9 (Last Use) by way of
     an all-rows scan into Spool 5 (all_amps), which is built locally
     on the AMPs. The size of Spool 5 is estimated with low confidence
     to be 2 rows (52 bytes). The estimated time for this step is 0.04
     seconds.
  4) Then we save the UPDATED STATISTICS from Spool 5 (Last Use) into
     Spool 3, which is built locally on the AMP derived from
     DBC.StatsTbl by way of the primary index. /*Hash of t1*/
  5) We lock DBC.StatsTbl for write on a RowHash.
   /*Raise a retryable error if the generated stats id is already 
   used*/
   /*by some other collect stats on the same table.*/
   /*The error step can also be done by USI on (TableId, StatsId)*/
  6) We do a single-AMP ABORT test from DBC.StatsTbl by way of the
     primary index "DBC.StatsTbl.TableId = <t1 Id> with a residual
     condition of ("DBC.StatsTbl.StatsId = <generated stats id>").
   /*Insert the histogram row with generated stats id*/
  7) We do a single-AMP MERGE into DBC.StatsTbl from Spool 3 (Last 
     Use). The size is estimated with low confidence to be 1 row. The
     estimated time for this step is 0.31 seconds.
   /*Update or insert the master record with the updated table level 
    demographics*/
  8) We do a single-AMP UPDATE from Spool 3 by way of the primary
     index "DBC.StatsTbl.TableId = <t1 Id>" with a residual condition 
     of DBC.StatsTbl.StatsId = 0. The size is
     estimated with low confidence to be 1 row. The estimated time for
     this step is 0.02 seconds. If the row cannot be found, then we do
     an INSERT into DBC.StatsTbl.
  9) We spoil the parser's dictionary cache for the table.
  10) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> No rows are returned to the user as the result of statement 1.

Note the following additional aspects of this EXPLAIN text:

In step 7, Vantage inserts the histogram row with the generated StatsId.
In step 8, Vantage updates the master record with new table-level demographics with a fixed StatsId of 0.
Note that master records always have a fixed StatsId of 0.

Example EXPLAIN Text for Recollecting Statistics

The following EXPLAIN text reports the steps taken to recollect statistics on column x1 from table t1 with embedded comments to clarify exactly what happens at critical steps in the process:

     EXPLAIN COLLECT STATISTICS COLUMN x1 ON t1;

Explanation
-----------------------------------------------------------------------
  1) First, we lock QSTATS.t1 for access.
  2) Next, we do an all-AMPs SUM step to aggregate from RK.t1 by way of
     an all-rows scan with no residual conditions,
     grouping by field1 ( RK.t1.x1). Aggregate Intermediate Results
     are computed locally, then placed in Spool 9. The size of Spool 9
     is estimated with low confidence to be 2 rows (44 bytes). The
     estimated time for this step is 0.03 seconds.
  3) We do an all-AMPs RETRIEVE step from Spool 9 (Last Use) by way of
     an all-rows scan into Spool 5 (all_amps), which is built locally
     on the AMPs. The size of Spool 5 is estimated with low confidence
     to be 2 rows (52 bytes). The estimated time for this step is 0.04
     seconds.
  4) Then we save the UPDATED STATISTICS from Spool 1 (Last Use) into
     Spool 3, which is built locally on the AMP derived from
     DBC.StatsTbl by way of the primary index.
  5) We lock DBC.StatsTbl for write on a RowHash. /*Hash of (t1)*/
  6) We do a single-AMP UPDATE from Spool 3 by way of the primary
     index "DBC.StatsTbl.TableId = <t1 Id>" with a residual condition 
     of DBC.StatsTbl.StatsId = <existing stats id>. The size is
     estimated with low confidence to be 1 row. The estimated time for
     this step is 0.02 seconds. If the row cannot be found, then we do
     an INSERT into DBC.StatsTbl.
  7) We do a single-AMP UPDATE from Spool 3 by way of the primary
     index "DBC.StatsTbl.TableId = <t1 Id>" with a residual condition 
     of DBC.StatsTbl.StatsId = 0. The size is
     estimated with low confidence to be 1 row. The estimated time for
     this step is 0.02 seconds. If the row cannot be found, then we do
     an INSERT into DBC.StatsTbl.
  8) We spoil the parser's dictionary cache for the table.
  9) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
-> No rows are returned to the user as the result of statement 1.