Collecting Statistics on the PARTITION Column and the Row Partitioning Column Set - Advanced SQL Engine

Collecting Statistics on the PARTITION Column and the Row Partitioning Column Set - Advanced SQL Engine - Teradata Database

SQL Data Definition Language Detailed Topics

Product

Advanced SQL Engine

Teradata Database

Release Number

17.10

Published

July 2021

Language

English (United States)

Last Update

2021-07-27

dita:mapPath

imq1591724555718.ditamap

dita:ditavalPath

imq1591724555718.ditaval

dita:id

B035-1184

lifecycle

Product Category

Teradata Vantage™

On partitioned tables (row or column), always collect PARTITION statistics and statistics on the row partitioning column set. Refresh statistics whenever row partition demographics change significantly. The Ten Percent Rule, the usual guideline of refreshing statistics after a 10 percent change in table demographics, does not apply for row partitioning columns. Instead, you should recollect statistics whenever the demographics for a row partition change by 10 percent of more. This guideline applies to both the row partitioning column set and to the system-derived PARTITION column for a partitioned table. You cannot collect statistics on the system-derived PARTITION#L n columns of a partitioned table.

This is a guideline, not a rule, and you might need to adjust the point at which you refresh partition statistics upward or downward to maintain high quality query plans.

For row partitioned tables, the Optimizer uses PARTITION statistics to estimate the number of populated partitions, selectivity based on partition elimination, I/O reduction ratio, and so forth. For column partitioned tables, the Optimize uses PARTITION statistics to estimate the column compression ratio. This cannot be estimated from evaluating standard data statistics and demographics.

You should collect single-column PARTITION statistics to enable the Optimizer to take advantage of the more accurate partitioning costing made possible when PARTITION statistics exist. For more information about the system-derived PARTITION column, see Teradata Vantage™ - Database Design, B035-1094. Collecting these statistics is fairly quick because collecting such statistics, even at a 100 percent level, only requires scanning of the cylinder indexes for the table plus at most n+1 data blocks, where n is the number of partitions for the table. Therefore, the system does not scan all of the data blocks for the table as it does for other columns when collecting statistics on 100 percent of the data.

The system does not use this fast collection method if a PARTITION column is part of a multicolumn statistics collection. Statistics on the system-derived PARTITION column provide the Optimizer with the best estimate of the number of populated partitions and the cardinality of each partition when statistics were last collected.

PARTITION statistics provide the Optimizer with information on whether partitions are empty or populated. The Optimizer can then use this information to make better cost estimates of candidate query plans when there is a significant number of empty partitions. Note that the Optimizer must generate a plan that accesses all defined partitions because they might not be empty at the time the plan executes.

If PARTITION statistics have not been collected, the Optimizer estimates the cardinality of a partition as follows.

Therefore, empty partitions can cause the Optimizer to underestimate the partition cardinality. The Optimizer reduces the number of partitions (especially if the maximum number could be 65,535) in some cases to avoid underestimating. However, the best policy to ensure accurate partition cardinality estimates is to define only partitions that are actually in use and to collect statistics on the system-derived PARTITION column.

Collecting multicolumn statistics for a base table that include the system-derived column PARTITION column can provide some benefit for estimating plan costs for a query when all the columns, including the system-derived PARTITION column, in the multicolumn statistics have single-table equality predicates for the query.

If a partition changes from being empty to being populated or from being populated to empty, you must recollect statistics on the system-derived PARTITION column.

You must also refresh statistics when significant changes occur to the table. The guideline of 10 percent change (from insert, update, delete, and merge operations) in rows applies at the partition level for partitioned tables instead of the table level. You might need to adjust this guideline as needed for your application.

The only exception to this guideline is a partitioned table row-partitioned by a single-column expression. In this case it is not necessary to collect separate PARTITION statistics in addition to the statistics you collect on the row partitioning column because those column statistics are automatically inherited as single-column PARTITION statistics.

For example, assume the following table definition.

CREATE TABLE table_1 (
  col_1 INTEGER,
  col_2 INTEGER)
PRIMARY INDEX(col_1) PARTITION BY col_2;

If you collect individual column statistics on col_2, which is the partitioning column, then those statistics are also inherited as PARTITION statistics.

Because PARTITION is not a reserved keyword, you can use it as the name for any column in a table definition, but you should not. This practice is strongly discouraged in all cases, and particularly for partitioned tables, because if you explicitly name a column partition, the database resolves it as a regular column. As a result, you cannot collect statistics on the system-derived PARTITION column of any table that also has a user-defined column named partition.

The system uses multicolumn PARTITION statistics to do single-table estimates when all the columns, including the PARTITION column, have single-table equality conditions.

You should also collect SUMMARY statistics on nonpartitioned tables, because these statistics can be collected quickly and they are very useful to the Optimizer for making cardinality estimates.

You can specify a USING SAMPLE clause to collect single-column PARTITION statistics, but the specification is ignored. Instead, the system automatically resets the internal sampling percentage to 100. It is important to collect accurate statistics for this case, and collecting PARTITION statistics is a very fast operation. As a result, sampling is not required to improve the performance of collecting those statistics. The sampling flag in a detailed statistics report is always reported as 0 to document this behavior. See SHOW STATISTICS. Note that the USING SAMPLE clause is honored for multicolumn PARTITION statistics. The database begins the collection process using a 2% sampling percentage and then increases the percentage dynamically if it determines that the values are skewed more than a system-defined threshold value.