Collecting Statistics on the PARTITION Column and the Row Partitioning Column Set - Teradata Vantage

Teradata® VantageCloud Lake

Deployment
VantageCloud
Edition
Lake
Product
Teradata Vantage
Published
January 2023
ft:locale
en-US
ft:lastEdition
2024-12-11
dita:mapPath
phg1621910019905.ditamap
dita:ditavalPath
pny1626732985837.ditaval
dita:id
phg1621910019905

On partitioned tables (row or column), collect PARTITION statistics and statistics on the row partitioning column set. Refresh statistics whenever row partition demographics change significantly. The Ten Percent Rule, the usual guideline of refreshing statistics after a 10 percent change in table demographics, does not apply for row partitioning columns. Instead, recollect statistics whenever the demographics for a row partition change by 10 percent of more. This guideline applies to both the row partitioning column set and to the system-derived PARTITION column for a partitioned table. You cannot collect statistics on the system-derived PARTITION#L n columns of a partitioned table.

This is a guideline, not a rule, and you may need to adjust the point at which you refresh partition statistics upward or downward to maintain high quality query plans.

For row partitioned tables, the Optimizer uses PARTITION statistics to estimate the number of populated partitions, selectivity based on partition elimination, I/O reduction ratio, and so forth. For column partitioned tables, the Optimize uses PARTITION statistics to estimate the column compression ratio. This cannot be estimated from evaluating standard data statistics and demographics.

Collect single-column PARTITION statistics to enable the Optimizer to take advantage of the more accurate partitioning costing made possible by these statistics. Collecting these statistics is quick because collecting such statistics, even at a 100 percent level, requires only scanning the cylinder indexes for the table and at most n+1 data blocks, where n is the number of partitions for the table. Therefore, the system does not scan all data blocks for the table, unlike when collecting statistics on 100 percent of the data for other columns.

The system does not use this fast collection method if a PARTITION column is part of a multicolumn statistics collection. Statistics on the system-derived PARTITION column provide the Optimizer with the best estimate of the number of populated partitions and the cardinality of each partition when statistics were last collected.

PARTITION statistics provide the Optimizer with information on whether partitions are empty or populated. The Optimizer can use this information to make better cost estimates of candidate query plans when there is a significant number of empty partitions. The Optimizer must generate a plan that accesses all defined partitions, because the partitions may not be empty when the plan runs.

If PARTITION statistics have not been collected, the Optimizer estimates the cardinality of a partition as follows.



Therefore, empty partitions can cause the Optimizer to underestimate the partition cardinality. The Optimizer may reduce the number of partitions (especially if the maximum number can be 65,535) to avoid underestimating. However, the best policy to make sure of accurate partition cardinality estimates is to define only partitions that are in use and to collect statistics on the system-derived PARTITION column.

Collecting multicolumn statistics for a base table that include the system-derived column PARTITION column can reduce estimating plan costs for a query when all columns in the multicolumn statistics (including the system-derived PARTITION column) have single-table equality predicates for the query.

If a partition changes from being empty to being populated or from being populated to empty, you must recollect statistics on the system-derived PARTITION column.

You must also refresh statistics when significant changes occur to the table. The guideline of 10 percent change (from insert, update, delete, and merge operations) in rows applies at the partition level for partitioned tables instead of the table level. You may need to adjust this guideline as needed for your application.

The only exception to this guideline is a partitioned table row-partitioned by a single-column expression. You need not collect separate PARTITION statistics and the statistics you collect on the row partitioning column because those column statistics are automatically inherited as single-column PARTITION statistics.

For example, assume the following table definition.

CREATE TABLE table_1 (
  col_1 INTEGER,
  col_2 INTEGER)
PRIMARY INDEX(col_1) PARTITION BY col_2;

If you collect individual column statistics on col_2, which is the partitioning column, then those statistics are also inherited as PARTITION statistics.

Teradata recommends against using PARTITION for a column in a table definition, especially for a partitioned table. If you do, you cannot collect statistics on the system-derived PARTITION column, because the database resolves explicitly named columns as regular columns.

The system uses multicolumn PARTITION statistics to do single-table estimates when all the columns, including the PARTITION column, have single-table equality conditions.

Collect SUMMARY statistics on nonpartitioned tables, because these statistics can be collected quickly and are useful to the Optimizer for making cardinality estimates.

You can specify a USING SAMPLE clause to collect single-column PARTITION statistics, but the specification is ignored. Instead, the system automatically resets the internal sampling percentage to 100. Collecting accurate single-column PARTITION statistics is important, and collecting PARTITION statistics is fast. Therefore, sampling is optional to improve the performance of collecting those statistics. The sampling flag in a detailed statistics report is reported as 0 to document this behavior. See SHOW STATISTICS. The USING SAMPLE clause is honored for multicolumn PARTITION statistics. The database begins the collection process using a 2% sampling percentage and increases the percentage dynamically if the values are skewed more than a system-defined threshold value.