16.10 - Using Join Indexes - Teradata Database

Teradata Database Design

prodname
Teradata Database
vrm_release
16.10
created_date
June 2017
category
User Guide
featnum
B035-1094-161K

You can create a join index to perform any of the following operations:

  • Join multiple tables, optionally with aggregation, in a prejoin table.
  • Replicate all or a vertical subset of a single base table and distribute its rows by a primary index on a foreign key column to facilitate joins of very large tables by hashing them to the same AMP.
  • Aggregate one or more columns of a single table or the join results of multiple tables in a summary table.
  • Support querying only those rows that satisfy the conditions specified by its WHERE clause. This is known as a sparse join index.
  • If the index has a unique primary index, and a request specifies an equality condition on the columns that define the primary index for the index, then the index can be used for the access path in two-AMP join plans similarly to how USIs are used.

The guidelines for creating a join index are the same as those for defining any regular join query that is frequently executed or whose performance is critical. The only difference is that for a join index the join result is stored as a subtable and automatically maintained by Teradata Database.

Performance and Join Indexes

Requests that can use join indexes can run many times faster than queries that do not use them. Performance improves whenever the Optimizer can rewrite a request to use a join index instead of the base tables specified by the query.

A join index is most useful when its columns can cover most or all of the requirements in a request. For example, the Optimizer might consider using a covering index instead of performing a merge join.

Covering indexes improve the speed of join queries. The extent of improvement can be dramatic, especially for requests involving complex, large-table, and multiple-table joins. The extent of the improvement depends on how often an index can be used to rewrite a query.

In-place join indexes, where the columns of the covering index and the columns of the table to which it is to be joined both reside on the same AMP, outperform indexes that require row redistribution. An in-place, covering, aggregate join index that replaces 2 or more large tables in requests with complex joins, aggregations, and redistributions can enable a request to run hundreds of times faster than it would otherwise.

Partial Covering Multitable Join Indexes

Teradata Database optimizes queries to use a join index on a set of joined tables even if the index does not completely cover the columns referenced in the table if the following things are true.

  • The index includes either the Row ID or the columns of a unique index on the table containing a non-covered column referenced by the query.
  • The cost of such a plan is less than other competing query plans.

A partial covering multitable join index provides some of the query improvement benefits that covering join indexes offer without replicating all of the table columns required to cover requests in the join index, but an additional overhead from having to access base table rows to retrieve column values occurs when a non-covered column is specified in a request.

Covering Bind Terms

A bind term is a condition that connects an outer query and a subquery. If the connecting condition of a subquery is IN and the column it is connecting to in the subquery is unique, you can define a join index on the bind term columns. This provides one more type of index for the Optimizer to consider using in place of multiple base tables.

Using Single-Table Join Indexes

Single-table join indexes are useful in tactical applications because they can support alternative access paths to data. This is a good approach to consider when a tactical query carries a value in an equality condition for a column, such as a customer phone number, that is in the table but is not its primary index. This might be a customer key, for example. A single-table join index can be constructed using the available non-indexed column, the customer phone number, as its primary index, thereby enabling single-AMP access to the data and avoiding more costly all-AMP non-primary index access to the base table.

Single-table join indexes are also valuable when your applications often join the same large tables, but their join columns are such that some row redistribution is required. A single-table join index can be defined to contain the data required from one of the tables, but using a primary index based on the FK of the table, preferably the primary index of the table to which it is to be joined. A single-table join index can also be used as a virtual vertical partitioning of a base table, creating an index subtable that contains frequently accessed columns from a table with many columns that generally are not accessed.

Use of such an index greatly facilitates join processing of large tables, because the single-table index and the table with the matching primary index both hash to the same AMP.

The Optimizer evaluates whether a single-table join index can replace or partially cover its base table even when the base table is referenced in a subquery unless the index is compressed and the join is complex, such as an outer join or correlated subquery join.

Using Outer Joins to Define Join Indexes

If there is a need for a non-aggregate multitable join index between several large tables, considering using an outer-join to define your join index. This approach offers the following benefits.

  • The Optimizer will consider a join index defined using an outer-join for queries that reference only the outer tables of the defining outer join.
  • The join index preserves the unmatched rows in the outer join within the join index structure.

Defining Join Indexes with Inequality Conditions

You can define join indexes using inequality conditions.

To define inequality conditions between 2 columns of the same type, either from the same table or from two different tables, you must AND them with the other join conditions.

This enables the Optimizer to resole a request using a join index more frequently than would otherwise be possible, avoiding the need to access base data tables to retrieve the required data.

Defining Join Indexes on UDT Columns and Expressions

You can define join indexes on the following UDT columns and expressions in the select list and in single-table conditions in the WHERE and ON clauses.

  • UDT columns, including using a method that is associated with a UDT column
  • The following types of expressions.
    • Non-aggregate expressions
    • Non-OLAP expressions
    • Non-UDF expressions

This general support for UDT columns and expressions enables you to specify Period data type columns and BEGIN, END and P_INTERSECT expressions on a Period data type in the select list, WHERE clause, and ON clause of a join index definition.

Refreshing Join Indexes

The ALTER TABLE TO CURRENT statement enables you to refresh the content of a join index without having to drop it and recreate it.

The efficiency of the ALTER TABLE TO CURRENT alternative compared with dropping and recreating a join index depends on how often an ALTER TABLE TO CURRENT request is executed and the type of current date condition defined in the join index.

If the join index is refreshed infrequently and the current date condition requires a large volume of old rows to be removed and a large volume of new rows to be inserted, it might be more efficient to drop and recreate the join index.

Using Aggregate Join Indexes

Aggregate join indexes offer an extremely efficient, cost-effective method of resolving requests that frequently specify the same aggregation operations on the same column or columns. When aggregate join indexes are available, the system does not have to repeat aggregation calculations for every request.

You can define an aggregate join index on two or more tables or on a single table. A single-table aggregate join index includes:

  • A subset of the columns in a base table
  • Additional columns for the aggregate summaries of the base-table columns

You can create an aggregate join index using the GROUP BY clause and the following built-in aggregate functions.

  • SUM
  • COUNT
  • MAX
  • MIN

The following restrictions apply to defining an aggregate join index.

  • Only the COUNT, MAX, MIN, and SUM aggregate functions are valid in any combination.

    COUNT DISTINCT and SUM DISTINCT are not valid.

  • To avoid overflow, always type the COUNT, MAX, MIN, and SUM columns in an aggregate join index definition as FLOAT.

    Teradata Database enforces this restriction as follows.

IF you … THEN Teradata Database …
do not define an explicit data type for a COUNT, MAX, MIN, or SUM column assigns the FLOAT data type to it automatically.
define a COUNT, MAX, MIN, or SUM column as anything other than FLOAT returns an error and does not create the aggregate join index.

Many aggregate functions are based on the SUM, MAX, MIN, and COUNT functions, so even though you cannot specify many individual aggregate functions in an aggregate join index, you can combine these 4 functions in a number of ways to create an aggregate join index to resolve requests that use more complicated aggregate functions.

A simple example is using the COUNT and SUM functions to compute an average.



Join Indexes and the Optimizer

For each base table in a query, the Optimizer performs certain processing phases to decide how a database operation that uses a join index is to be processed.

In this phase… The optimizer...
Qualification evaluates up to 10 join indexes to choose the one with the lowest cost.

Qualification for the best plan includes one or more of the following benefits:

  • Smallest size to process
  • Most appropriate distribution
  • Ability to take advantage of covered fields within the join index
Analysis of results determines if this plan will result in unique results, analyzing only those tables in the query that are used in the join index.

Subsequent action depends on analysis of the results.

IF the results are... THEN the Optimizer...
unique skips the sort-delete steps used to remove duplicates.
nonunique determines whether eliminating all duplicates can still produce a valid plan, recognizing any case where the following things are true.
  • No column_name parenthetical clause exists
  • All logical rows will be accessed

System Processing of Join Indexes

The Optimizer does the following when it rewrites requests using a join index.

  • Selects cost-based query rewrites using the best available aggregate join index when several possible aggregate join indexes are available.
  • Provides a larger number of opportunities to perform cost-based rewrites of requests using aggregate join indexes for queries with subqueries, spooled derived tables, outer joins, COUNT(DISTINCT), and extended grouping sets.

    When you create multiple aggregate join indexes, the creation of the current aggregate join index makes use of an existing aggregate join index that is most efficient for the calculation of the aggregate join index being created so that the CREATE JOIN INDEX request has better performance.

    With the existence of multiple join indexes, including aggregate join indexes and non-aggregate join indexes, aggregate queries perform better with the cost-based rewrite and more chances to use an aggregate join index.

  • Uses join indexes with Partial GROUP BY optimizations during join planning, making it possible to produce better join plans.

Join Index Optimizations

The Optimizer uses join indexes in several ways, including the following.

  • Selects cost-based query rewrites using the best available aggregate join index when several possible aggregate join indexes are available.
  • Provides a larger number of opportunities to perform cost-based rewrites of requests using aggregate join indexes for queries with subqueries, spooled derived tables, outer joins, COUNT(DISTINCT) operations, and extended grouping sets.

    When a user creates multiple aggregate join indexes, the creation of the current aggregate join index makes use of an existing aggregate join index that is the most efficient for the calculation of this aggregate join index so that the CREATE JOIN INDEX request will have better performance.

    With the existence of multiple join indexes, including aggregate join indexes and non-aggregate join indexes, aggregate requests perform better with the cost-based rewrite and more chances to use an aggregate join index.

  • Uses join indexes with Partial GROUP BY optimizations during join planning, making it possible to produce better join plans.

Protecting a Join Index with Fallback

You can define fallback protection for a simple or aggregate join index.

With fallback, you can access a join index and the base table it references if an AMP fails, with little impact on performance.

Without fallback, an AMP failure has significant impact on both availability and performance as follows.

  • You cannot update the base table referenced by a join index even if that base table is defined with fallback.
  • The Optimizer cannot access a join index on a down AMP to create query plans. Performance can be degraded significantly when this occurs.

The cost of having fallback for a join index when executing a DML request that modifies a base table referenced by the join index is a slight increase in processing to maintain the fallback copy of the join index.

Collecting Statistics for Join Indexes

Hash indexes and single-table join indexes that are not defined as sparse join indexes inherit all statistics from their base table, including dynamic AMP samples and collected statistics.

Only sparse join indexes and multitable join indexes require statistics collection. It is particularly important that statistics be collected on the sparse-defining column in the WHERE clause of a sparse join index or the Optimizer might not select the sparse join index for use.

Consider collecting statistics to improve performance during the following operations.

  • Creation of a join index
  • Update maintenance of a join index

You need to submit separate COLLECT STATISTICS requests for the columns in the join index and the source columns in the base tables. This does not have a very high cost because Teradata Database can collect statistics while queries are accessing the underlying base tables of a join index.

Costing Considerations for Join Indexes

Join indexes, like secondary indexes, incur both space and maintenance costs. For example, INSERT, UPDATE, and DELETE operations must be performed twice: once for the base table and once for the join index.

Space Costs for Join Indexes

The following formula estimates the space overhead required for a join index.

Join Index Size = U × (F + O + (R × A))

where:

Parameter Description
F Length of the fixed column join_index_column_1
R Length of a single repeating column join_index_column_2
A Average number of repeated fields for a given value in join_index_column_1
U Number of unique values in the specified join_index_column_1
O Row overhead (assume 14 bytes)

Updates to the base tables can cause a physical join index row to split into multiple rows. The newly formed rows each have the same fixed field value but contain a different list of repeated field values. This applies specifically when the compressed join index format is being used.

The system, however, does not automatically recombine logically related split rows. To re-compact such rows, you must drop and recreate the join index.

Maintenance Costs for Join Indexes

The use of a join index entails the following.

  • Initial time consumed to calculate and create the index
  • Whenever a value in a join index column of the base table is updated, the join index must also be updated, including any required aggregation or pre-join effort.

However, if join indexes are suited to your applications, the improvements in request performance can far outweigh the costs.

Join indexes are maintained by generating additional AMP steps in the base table update execution plan. Those join indexes defined with outer joins usually require additional steps to maintain any unmatched rows.

Expect a single-table join index INSERT operation to have similar maintenance overhead as would an insert operation with an equivalent NUSI. UPDATE or DELETE operations, however, might incur greater overhead with a single-table join index, unless a value for the primary index of the join index is available at the time of the update.

Overhead for an in-place aggregate join index can be perhaps 3 times more expensive than maintaining the same table without that index. For an aggregate join index that redistributes rows, the maintenance overhead can be several times as expensive.

Maintenance overhead for multitable join indexes without aggregates can be small or very large, depending on the pre-join effort involved in constructing or changing a join index row. This could be up to 20 times or more expensive than maintaining the table without the index. The overhead is greater at higher hits per block, where hits means the number of rows in a block are touched.

Since Teradata Database writes a block only once regardless of the number of rows modified, as the number of hits per block increases:

  • The CPU path per transaction decreases (faster for the case with no join index than for the case with a join index)
  • Maintenance overhead for aggregate join indexes decreases significantly

If a DELETE or UPDATE request specifies a search condition on the primary index or secondary index of a join index, the join index may be directly searched for the qualifying rows and modified accordingly.

This direct-update approach is employed when the request adheres to these requirements:

  • A primary index or secondary index access path to the join index
  • If a join_index_column_2 is defined, little or no modification to the join_index_column_1 columns
  • No modifications to the join condition columns in the join index definition
  • No modifications to the primary index columns of the join index

It is not necessary to drop the join index before a backup. It is important, however, to drop join indexes before the underlying tables and databases are restored, should a restore ever be required. Otherwise an error is reported and the restore will not be done.

Join Indexes Versus NUSIs

A join index offers the same benefits as a standard secondary index in that it, like the standard secondary index, has the following properties.

  • Optional
  • User defined
  • Maintained by the system
  • Transparent to end users
  • Immediately available to the Optimizer
  • If a covering index, considered by the Optimizer for a merge join

However, a join index offers the following performance benefits over a NUSI.

IF a join index is… THEN performance improves by…
defined using joins on one or more columns from two or more base tables eliminating the need to perform the join step every time a joining query is processed.
used for direct access in place of some or all of its base tables, if the Optimizer determines that it covers most or all of the query. eliminating the I/Os and resource usage required to access the base tables.
limited to only certain data types of your choice, such as Date allowing direct access to the join index rows within the specified value-order range.
a single-table join index with a FK primary index reducing I/Os and message traffic because row redistribution is not required, since the following are hashed to the same AMP:
  • A single-table join index having a primary index based on the base table foreign key.
  • The table with the column set making up the foreign key.
defined with an outer join
  • Giving the same performance benefits as a single-table join index, for queries that reference only outer tables.
  • Preserving unmatched rows.
created using aggregates eliminating both the aggregate calculations and the join step for every query requiring the join and aggregate.

For more information on the syntax, applications, restrictions, and benefits of join indexes, see SQL Data Definition Language.