Depending on the complexity of the joins or the cardinalities of base tables that must be aggregated, join indexes optimize the performance of certain types of workloads.
Selecting the Primary Index for Join Indexes
As with any other Teradata Database table, you must define a primary index for any join index you create unless it is column-partitioned (in which case it is optional). While the primary index for any join index must be defined as a NUPI, the column on which it is defined need not be nonunique; in fact, the usual preferences for defining a unique rather than a nonunique column set as the primary index for a base table also apply to join indexes.
Because join indexes are not part of the logical model for a database, and because they are de facto denormalized database objects, there might not be an overriding reason to define one column set over another as the primary index apart from how much more evenly one set might distribute join index table rows than another.
It is usually important to use the join key as the primary index for single-table join indexes, but otherwise there is no compelling theoretical reason to select one unique (or, if not unique, highly singular) column set over another as the primary index. If you create a join index to support a row-partitioned base table, you should consider creating it using row partitioning. This is valid only if the join index is not also row-compressed. You can also create a row-partitioned join index for a base table that does not have row partitioning. It is always possible that such a join index might provide an alternative organization of the data that facilitates access based on partitions. For example, suppose you have a nonpartitioned base table designed to handle efficient joins on its primary index. You might also want to create a row-partitioned join index on the table that optimizes fast row-partition-based access to the data. The rules for creating a row-partitioned join index are generally the same as those for creating a row-partitioned base table.
If a join index is designed to support range queries, you should consider specifying a row-partitioned primary index for it. Note that you cannot specify a row-partitioned primary index for a row-compressed join index.
Join index primary indexes are defined analogously to base table primary indexes: with the CREATE JOIN INDEX statement.
Selecting Secondary Indexes for Join Indexes
Join indexes are often more high-performing if they have one or more secondary indexes defined on them. The Optimizer adds a join index (even a partially covering join index) to its query plan whenever it can, and just because you define it to have a particular most likely use and access path, there is no reason to believe that the same join index might not also be useful for other, unplanned, queries. The Optimizer will join a base table that is unrelated to a join index with that join index if the query plan can be made more cost effective by doing so.
A secondary index on a join index cannot be defined as UNIQUE, even though the column set on which it is defined is unique. This rule is enforced because of the way indexes on join index tables are handled internally.
See “CREATE JOIN INDEX” in SQL Data Definition Language for further information about creating join indexes and using secondary indexes with them.
Statistics and Other Demographic Data for Join Indexes
Collect statistics on the indexes of a join index to provide the Optimizer with the information it needs to generate an optimal plan.
The COLLECT STATISTICS (Optimizer Form) statement collects demographics, computes statistics from them, and writes the resulting data into individual entries for each individual base table and join index table on the system.
As far as the Optimizer is concerned, a multitable join index and the base tables it supports are entirely separate entities. You must collect statistics on multitable join index columns separately from the statistics you collect on their underlying base table columns because the column statistics for multitable join indexes and their underlying base tables are not interchangeable.
On the other hand, it is generally preferable to collect statistics on the underlying base table of a single-table join index and not directly on the join index columns. See Guidelines for Collecting Statistics on Single-Table Join Indexes and Collecting Statistics on Base Table Columns Instead of Single-Table Join Index Columns below for further details on collecting statistics for single-table join indexes. Note that the derived statistics framework supports bidirectional inheritance of statistics between a non-sparse single-table join index and its underlying base table. See SQL Request and Transaction Processing for details.
Guidelines for Collecting Statistics on Multitable Join Indexes
The guidelines for collecting statistics on the relevant columns are the same as those for any regular join query that is frequently executed or whose performance is critical. The only difference with join indexes is that the join result is persistently stored and maintained by the AMP software without user intervention.
Note the following guidelines for collecting statistics for join indexes.
- To improve the performance of creating a join index and maintaining it during updates, collect statistics on its base tables immediately prior to creating the join index.
- Collect statistics on all the indexes defined on your join indexes to provide the Optimizer with the information it needs to generate an optimal plan.
- Collect statistics on additional join index columns that frequently appear in WHERE clause search conditions, especially when the column is the sort key for a value-ordered join index because the Optimizer can then use that information to more accurately compare the cost of using a NUSI-based access path in conjunction with range or equality conditions specified on the sort key column.
- In general, there is no benefit in collecting statistics on a join index for joining columns specified in the join index definition itself. Statistics related to these columns should be collected on the underlying base tables rather than on the join index.
The only time you gain an advantage by collecting statistics on a join column of the join index definition is when that column is used as a join column to other base tables in queries where the join index is expected to be used in the Optimizer query plan.
Guidelines for Collecting Statistics on Single-Table Join Indexes
For most applications, you should collect the statistics on base table columns rather than on single-table join index columns. See Collecting Statistics on Base Table Columns Instead of Single-Table Join Index Columns for further information.
You can collect and drop statistics on the primary index columns of a single-table join index using an INDEX clause to specify the column set. For example, suppose a join index has the following definition.
CREATE JOIN INDEX OrdJIdx8 AS SELECT o_custkey, o_orderdate FROM Orders PRIMARY INDEX (o_custkey, o_orderdate) ORDER BY (o_orderdate);
Then you can collect statistics on the index columns as a single object as shown in the following example.
COLLECT STATISTICS ON OrdJIdx8 INDEX (o_custkey, o_orderdate);
Note that you can only use columns specified in a PRIMARY INDEX clause definition if you use the keyword INDEX in your COLLECT STATISTICS statement.
A poorer choice would be to collect statistics using the column clause syntax. To do this, you must perform two separate statements.
COLLECT STATISTICS ON OrdJIdx8 COLUMN (o_custkey); COLLECT STATISTICS ON OrdJIdx8 COLUMN (o_orderdate);
It is always better to collect the statistics for a multicolumn index on the index itself rather than individually on its component columns because its selectivity is much better when you collect statistics on the index columns as a set.
For example, a query with a predicate like WHERE x=1 AND y=2 is better optimized if statistics are collected on INDEX (x,y) than if they are collected individually on column x and column y.
The same restrictions hold for the DROP STATISTICS statement.
Collecting Statistics on Base Table Columns Instead of Single-Table Join Index Columns
The Optimizer substitutes base table statistics for single-table join index statistics when no demographics have been collected for its single-table indexes. Because of the way single-table join index columns are built, it is generally best not to collect statistics directly on the index columns and instead to collect them on the corresponding columns of the base table. This optimizes both system performance and disk storage by eliminating the need to collect the same data redundantly.
You might need to collect statistics on single-table join index columns instead of their underlying base table columns if you decide not to collect the statistics for the relevant base table columns for some reason. In this case, you should collect statistics directly on the corresponding single-table join index columns.
Note that the derived statistics framework can use bidirectional inheritance of statistics between base tables and their underlying non-sparse single-table join indexes, so the importance of collecting statistics on base table columns rather than non-sparse single-table join index columns is no longer as important as it once was (see SQL Request and Transaction Processing for details).
Guidelines for Collecting Statistics On Single-Table Join Index Columns
The guidelines for selecting single-table join index columns on which to collect statistics are similar to those for base tables. The primary factor to consider in all cases is whether the statistics provide better access plans. If they do not, consider dropping them. If the statistics you collect produce worse access plans, then you should always report the incident to Teradata support personnel.
When you are considering collecting statistics for a single-table join index, it might help to think of the index as a special kind of base table that stores a derived result. For example, any access plan that uses a single-table join index must access it with a direct probe, a full table scan, or a range scan. With this in mind, consider the following factors when deciding which columns to collect statistics for.
- Always consider collecting statistics on the primary index. This is particularly critical for accurate cardinality estimates.
|IF an execution plan might involve …||THEN collect statistics on the …|
|search condition keys||column set that constitutes the search condition predicate.|
|joining the single-table index with another table||join columns to provide the Optimizer with the information it needs to best estimate the cardinalities of the join.|
|IF a single-table join index is defined …||THEN you should collect statistics on the …|
|with an ORDER BY clause||order key specified by that clause.|
|without an ORDER BY clause and the order key column set from the base table is not included in the column_name_1 list||order key of the base table on which the index is defined.
This action provides the Optimizer with several essential baseline statistics.
- If a single-table join index column appears frequently in WHERE clause predicates, you should consider collecting statistics on it as well, particularly if that column is the sort key for a value-ordered single-table join index.