How the AMP Software Collects Statistics | VantageCloud Lake - How the AMP Software Collects Statistics - Teradata Vantage

Teradata® VantageCloud Lake

Deployment
VantageCloud
Edition
Lake
Product
Teradata Vantage
Published
January 2023
Language
English (United States)
Last Update
2024-04-03
dita:mapPath
phg1621910019905.ditamap
dita:ditavalPath
pny1626732985837.ditaval
dita:id
phg1621910019905
When you collect statistics and demographics, the AMP software creates statistics rows in DBC.StatsTbl that stores statistics and demographics that summarize the table columns specified by the COLLECT STATISTICS statement. These rows come in the following basic types:
  • Regular statistics rows, one or more per AMP
  • Null rows, zero or one per AMP
  • All-null rows, zero or one per AMP
  • Average rows per value, one per AMP

The information taken from these rows is used to populate the statistical interval histograms used by the Optimizer to make its initial cardinality estimates (see Interval Histograms).

The following statistics are global per histogram, so are reported only once:
  • Number of null rows
  • Number of all-null rows
  • Overall average of the average number of rows per value per AMP

You can report these statistics using HELP STATISTICS (Optimizer Form). HELP STATISTICS reports only summary statistics. To report detailed statistics, use SHOW STATISTICS.

Null and All-Null Statistics and Demographics

Null and all-null statistics provide the following cardinality information:

Statistic Definition
NumNulls Number of rows in the column set for which one or more of the columns are null.
NumAllNulls Number of rows in a composite column set for which all of the columns are null.

Vantage collects the NumAllNulls statistic for composite column sets only when all columns in the set are null. That is, you can collect a NumAllNulls value only when you collect multicolumn statistics or collect statistics on a composite index, and none of the columns in that set have a value.

Consider the following example of what is meant by null and all-null rows. Suppose you create a 3-column USI for table t_ex on columns b, c, and d. When you collect statistics on the USI, one row is found in which all 3 columns making up the USI are null. This is reported as an all-nulls instance in the statistics. This example is useful only as an example, because there can never be more than one row having a USI with all null columns in a table.

The columns for this row look something like the following, where nulls are represented by the QUESTION MARK character.

t_ex          
             a              b              c             d              e              f
            PI                                        USI    
                   355 ? ? ?          25000.00         375000.00

This contrasts with the case where statistics have been collected on a composite column set, and one or more, but not necessarily all, of the columns on which those statistics have been collected for a given row are null.

Assume the same table and index definitions as before. When you collect statistics on the USI, any row found where one or two, but not all, of the three columns making up the USI are null are reported as an occurrence of a null column in the statistics.

All of the following rows in t_ex are reported as null fields, because all have at least one null in one of the USI columns, while only the first row is reported for the all-null fields statistic.

t_ex            
        a         b         c         d          e          f  
       PI                           USI      
         355 ? ? ?  25000.00 375000.00 <--   all-null fields (tan)
             
         685 ?          155 ?  45000.00 495000.00 <--   null fields (orange)
         900          325 ? ?  32000.00 400000.00
         275 ? ?          760  55000.00 575000.00
         597 ?          254          893  10000.00 150000.00
         322          679 ?          204  75000.00 650000.00
         781          891          357 ?  32000.00 400000.00

All-nulls describes the case where statistics have been collected on a composite column set, and all the columns on which those statistics have been collected for a given row are null. With this information, the Optimizer can more accurately estimate the true number of unique values in a set, thus enabling more accurate join costing.

For example, suppose you have table t1 with columns x1 and y1, and values for x1 and y1.

  t1      
  x1 y1    
  10 10    
         
  20 ?   <--      null fields (orange)
  ? 30    
         
  ? ?   <--      all-null fields (tan)
  ? ?    

Again, the all nulls composite fields are shaded orange and the partly null composite fields are shaded green.

If you can collect only the NumNulls statistic, then when you collected statistics on composite columns, and one or more of the columns was null, the row is counted as a null row. This is not an issue for single-table cardinality estimation because a comparison against a null is evaluated as FALSE and is treated as an unmatched row.

For example, if you can collect only the NumNulls statistic, the histogram on (x1, y1) indicates that the number of nulls is 4, the number of unique values is 2, and the total number of rows is 5. If columns x and y are used as join columns, the Optimizer evaluates the join costing by incorrectly assuming there are only 2 unique values in the data when there are 4. This can cause problems in scenarios such as redistribution costs, skew detection, and the like.

To circumvent this problem, Vantage computes the true number of distinct partial nulls and saves that number in the histogram as NumPNulls.

The following equation calculates the true distinct values for partial nulls:


True distinct value for partial nulls equation
Equation Element Description
NumUniqueValues the number of true distinct values for partially null rows.
NumValues the number of non-distinct values for partially null rows.
NumPNullValues the number of true distinct values among the partially null rows.
1 an adjustment for the estimation error when the cardinality of all nulls > 0.

To make this computation, Vantage injects a marker and uses the following aggregation query to collect statistics on (x1,y1). For the following request, the fieldID for null_marker is set to 2 and the fieldID for cnt is set to 3:

SELECT CASE
           WHEN x1 IS NULL
           AND  y1 IS NULL
           THEN 1
           ELSE IF x1 IS NULL
           OR
                   y1 IS NULL
           THEN 2
       ELSE 0
       END AS null_marker ,x1, y1, COUNT(*) AS cnt
FROM t_coll_stats
GROUP BY 1, 2;

Sampled Statistics

To collect sampled statistics, Vantage generates a retrieve request to do the sampling and then passes the sampled rows to the subsequent aggregation step. The retrieve request can use TOPn operations to retrieve the rows if the table is hash distributed and either hard ordered or a NoPI table. For PPI tables that are sorted on their RowKey value, Vantage can generate a true sample step.

PARTITION Statistics

The aggregation step builds a detailed row with the partition number and the cardinality for queries such as the following:

SELECT PARTITION, COUNT(*)
FROM t_coll_stats;

UDF Statistics

You can collect statistics on deterministic UDFs, but not on nondeterministic UDFs.

Consider the following deterministic UDF:

CREATE FUNCTION months_between(date1 TIMESTAMP, date2 TIMESTAMP)
RETURNS FLOAT
LANGUAGE C
NO SQL
DETERMINISTIC
SPECIFIC months_between_tt
EXTERNAL NAME 'CS!months_between_tt!$PGMPATH$/months_between_tt.c'
PARAMETER STYLE SQL;

The following statement collects statistics on UDF months_between():

COLLECT STATISTICS
COLUMN months_between(BEGIN(policy_duration),
                      END(policy_duration)) AS 
                      Stats_MthsBetweenBegAndEnd
ON policy_types;
The crucial issue for UDF statistics is not collection, but dropping. If you drop a deterministic UDF after you have collected its statistics, those statistics become unusable. Because UDFs do not belong to a table or column, dropping their collected statistics is expensive. Therefore, Vantage uses the following approach to dropping statistics that have been collected on deterministic UDFs:
  • When Vantage loads statistics header information from the DBC.StatsTbl for a table to handle a user query or another statistics-related statement, the system detects nonvalid expressions at the time the query is resolved.

    For statistics collected on an unresolvable expression, the system updates the validStats field to FALSE for the corresponding statistics in DBC.StatsTbl.

  • When retrieving histograms, Vantage only retrieves those for valid statistics. This minimizes the overhead of parsing nonvalid statistics during query processing time.
  • The SHOW STATISTICS and HELP STATISTICS statements report only valid statistics.
  • A DROP STATISTICS request on a table forces Vantage to drop all statistics, both valid and nonvalid.
  • A SHOW STATISTICS COLUMN request made on nonvalid statistics returns a warning message.
  • A DROP STATISTICS COLUMN request made on nonvalid statistics returns a warning message.
  • Recollecting statistics on a UDF revalidates the statistics if the UDF exists at the time the statistics are recollected.
  • You can query the DBC.StatsV system view to determine nonvalid statistics by checking the validStats column

    You cannot drop individual nonvalid statistics.

    To drop nonvalid statistics, you must first drop all statistics on the table and then recollect valid statistics.

  • If you drop a UDF and recreate it to return a different type, Vantage ignores the existing statistics for that function during optimization.

Average AMP Rows Per Value Statistic

The average AMP rows per value statistic is an exact system-wide average of the cardinality per value for each individual AMP over the number of rows per value for a NUSI column set on which statistics have been collected. This statistic is computed only for NUSI columns and is used for nested join costing.

Example EXPLAIN Text for First Time Statistics Collection

The following EXPLAIN text reports the steps taken to build a histogram and collect statistics for the first time on column x1 from table t1 with embedded comments to clarify exactly what happens at critical steps in the process.

EXPLAIN COLLECT STATISTICS COLUMN x1 ON t1;
Explanation
-----------------------------------------------------------------------
  1) First, we lock QSTATS.t1 for access.
  2) Next, we do an all-AMPs SUM step to aggregate from DF2.t1 by way 
     of an all-rows scan with no residual conditions,
     grouping by field1 ( RK.t1.x1). Aggregate Intermediate Results
     are computed locally, then placed in Spool 9. The size of Spool 9
     is estimated with low confidence to be 2 rows (44 bytes). The
     estimated time for this step is 0.03 seconds.
  3) We do an all-AMPs RETRIEVE step from Spool 9 (Last Use) by way of
     an all-rows scan into Spool 5 (all_amps), which is built locally
     on the AMPs. The size of Spool 5 is estimated with low confidence
     to be 2 rows (52 bytes). The estimated time for this step is 0.04
     seconds.
  4) Then we save the UPDATED STATISTICS from Spool 5 (Last Use) into
     Spool 3, which is built locally on the AMP derived from
     DBC.StatsTbl by way of the primary index. /*Hash of t1*/
  5) We lock DBC.StatsTbl for write on a RowHash.
   /*Raise a retryable error if the generated stats id is already 
   used*/
   /*by another collect stats on the same table.*/
   /*The error step can also be done by USI on (TableId, StatsId)*/
  6) We do a single-AMP ABORT test from DBC.StatsTbl by way of the
     primary index "DBC.StatsTbl.TableId = <t1 Id> with a residual
     condition of ("DBC.StatsTbl.StatsId = <generated stats id>").
   /*Insert the histogram row with generated stats id*/
  7) We do a single-AMP MERGE into DBC.StatsTbl from Spool 3 (Last 
     Use). The size is estimated with low confidence to be 1 row. The
     estimated time for this step is 0.31 seconds.
   /*Update or insert the master record with the updated table level 
    demographics*/
  8) We do a single-AMP UPDATE from Spool 3 by way of the primary
     index "DBC.StatsTbl.TableId = <t1 Id>" with a residual condition 
     of DBC.StatsTbl.StatsId = 0. The size is
     estimated with low confidence to be 1 row. The estimated time for
     this step is 0.02 seconds. If the row cannot be found, then we do
     an INSERT into DBC.StatsTbl.
  9) We spoil the parser's dictionary cache for the table.
  10) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> No rows are returned to the user as the result of statement 1.
Additional aspects of this EXPLAIN text:
  • In step 7, Vantage inserts the histogram row with the generated StatsId.
  • In step 8, Vantage updates the master record with new table-level demographics with a fixed StatsId of 0.
  • Master records have a fixed StatsId of 0.

Example EXPLAIN Text for Recollecting Statistics

The following EXPLAIN text reports the steps taken to recollect statistics on column x1 from table t1 with embedded comments to clarify exactly what happens at critical steps in the process:

EXPLAIN COLLECT STATISTICS COLUMN x1 ON t1;
Explanation
-----------------------------------------------------------------------
  1) First, we lock QSTATS.t1 for access.
  2) Next, we do an all-AMPs SUM step to aggregate from RK.t1 by way of
     an all-rows scan with no residual conditions,
     grouping by field1 ( RK.t1.x1). Aggregate Intermediate Results
     are computed locally, then placed in Spool 9. The size of Spool 9
     is estimated with low confidence to be 2 rows (44 bytes). The
     estimated time for this step is 0.03 seconds.
  3) We do an all-AMPs RETRIEVE step from Spool 9 (Last Use) by way of
     an all-rows scan into Spool 5 (all_amps), which is built locally
     on the AMPs. The size of Spool 5 is estimated with low confidence
     to be 2 rows (52 bytes). The estimated time for this step is 0.04
     seconds.
  4) Then we save the UPDATED STATISTICS from Spool 1 (Last Use) into
     Spool 3, which is built locally on the AMP derived from
     DBC.StatsTbl by way of the primary index.
  5) We lock DBC.StatsTbl for write on a RowHash. /*Hash of (t1)*/
  6) We do a single-AMP UPDATE from Spool 3 by way of the primary
     index "DBC.StatsTbl.TableId = <t1 Id>" with a residual condition 
     of DBC.StatsTbl.StatsId = <existing stats id>. The size is
     estimated with low confidence to be 1 row. The estimated time for
     this step is 0.02 seconds. If the row cannot be found, then we do
     an INSERT into DBC.StatsTbl.
  7) We do a single-AMP UPDATE from Spool 3 by way of the primary
     index "DBC.StatsTbl.TableId = <t1 Id>" with a residual condition 
     of DBC.StatsTbl.StatsId = 0. The size is
     estimated with low confidence to be 1 row. The estimated time for
     this step is 0.02 seconds. If the row cannot be found, then we do
     an INSERT into DBC.StatsTbl.
  8) We spoil the parser's dictionary cache for the table.
  9) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
-> No rows are returned to the user as the result of statement 1.
Additional aspects of this EXPLAIN text:
  • In step 6, Vantage updates the histogram row with the existing StatsId.
  • In step 7, Vantage updates the master record with new table-level demographics with a fixed StatsId of 0.
  • Master records have a fixed statistics ID of 0.