Collecting Statistics on Hash Index Columns - Advanced SQL Engine

Collecting Statistics on Hash Index Columns - Advanced SQL Engine - Teradata Database

Database Design

Product

Advanced SQL Engine

Teradata Database

Release Number

17.10

Published

July 2021

Language

English (United States)

Last Update

2021-07-27

dita:mapPath

kko1591750222108.ditamap

dita:ditavalPath

kko1591750222108.ditaval

dita:id

B035-1094

lifecycle

Product Category

Teradata Vantage™

When there are no statistics for a hash index, the Optimizer uses the statistics of the corresponding base table rows, just as it does for a single-table join index.

The guidelines for selecting hash index columns on which to collect statistics are similar to those for base tables and join indexes. The primary factor to consider in all cases is whether the statistics provide better access plans. If they do not, consider dropping them. If the statistics you collect produce worse access plans, you should always report the incident to Teradata Services personnel.

When you consider collecting statistics for a hash index, it might help to think of the index as a special kind of base table that stores a derived result. For example, any access plan that uses a hash index must access it with a direct probe, a full table scan, or a range scan.

With this in mind, consider the following when deciding which columns to collect statistics for.

IF an execution plan might involve …	THEN collect statistics on the …
search condition keys	column set that constitutes the search condition predicate.
joining the hash index with another table	join columns to provide the Optimizer with the information it needs to best estimate the cardinalities of the join.

IF a hash index is defined …	THEN you should collect statistics on the …
with a BY clause or ORDER BY clause (or both)	primary index and order keys specified by those clauses.
without a BY clause	primary index column set of the base table on which the index is defined.
without an ORDER BY clause and the order key column set from the base table is not included in the column_name_1 list	order key of the base table on which the index is defined. This action provides the Optimizer with several essential baseline statistics, including the cardinality of the hash index.

If a hash index column appears frequently in WHERE clause predicates, you should consider collecting statistics on it as well, particularly if that column is the sort key for a value-ordered hash index.
Consider collecting statistics on the base table columns that are also part of the hash index rather than collecting statistics on the hash index itself.

General Guidelines

You should collect statistics on appropriate columns of a hash index frequently just as you would for any base table or join index.

For most applications, you should collect the statistics on base table columns rather than on hash index columns.

When you create a hash index with a BY clause, you can collect and drop statistics on those columns using an INDEX clause to specify the column set. For example, suppose a hash index has the following definition.

     CREATE HASH INDEX OrdHIdx8 (o_custkey, o_orderdate) ON orders
     BY (o_custkey, o_orderdate)
     ORDER BY (o_orderdate);

Then you can collect statistics on the partitioning columns as shown in the following example.

     COLLECT STATISTICS ON OrdHIdx8 INDEX (o_custkey, o_orderdate);

Note that you can only use columns specified in a BY clause definition if you use the keyword INDEX in your COLLECT STATISTICS statement.

A poorer choice would be to collect statistics using the column clause syntax. To do this, you must perform two separate statements.

     COLLECT STATISTICS ON OrdHIdx8 COLUMN (o_custkey);

     COLLECT STATISTICS ON OrdHIdx8 COLUMN (o_orderdate);

It is always better to collect the statistics for a multicolumn index on the index itself rather than individually on its component columns because its selectivity is much better when you collect statistics on the index columns as a set.

For example, a query with a predicate like WHERE x=1 AND y=2 is better optimized if statistics are collected on INDEX (x,y) than if they are collected individually on column x and column y.

The same restrictions hold for the DROP STATISTICS statement.

Collecting Statistics on Base Table Columns Instead of Hash Index Columns

The Optimizer substitutes base table statistics for hash index statistics when no demographics have been collected for its hash indexes. Because of the way hash index columns are built, it is generally best not to collect statistics directly on the index columns and instead to collect them on the corresponding columns of the base table. This optimizes both system performance and disk storage by eliminating the need to collect the same data redundantly.

Collect statistics on a hash index or non-sparse join index directly in the following scenarios:

If you decide not to collect the statistics for the relevant base table columns for some reason, then you should collect them directly on the corresponding hash index columns.
If the hash index is row or column partitioned, collect single-column and multicolumn PARTITION statistics directly on the hash index.
Collect SUMMARY statistics directly on the hash index as the summary attributes such as database block size, row size, etc. differ from the underlying base table.