Principal Criteria for Selecting a Primary Index - Advanced SQL Engine - Teradata Database

Database Design

Product
Advanced SQL Engine
Teradata Database
Release Number
17.05
17.00
Published
June 2020
Language
English (United States)
Last Update
2021-01-22
dita:mapPath
qby1588121512748.ditamap
dita:ditavalPath
lze1555437562152.ditaval
dita:id
B035-1094
lifecycle
previous
Product Category
Teradata Vantage™

When assigning columns to be the primary index for a table, there are three essential factors to keep in mind: uniform distribution of rows, optimal access to the data, and the volatility of indexed column values.

You will sometimes encounter situations where the selection criteria conflict. For example, specifying a NUPI instead of a UPI, or specifying an alternate key as the UPI instead of the primary key.

There are additional criteria to evaluate when selecting the primary index for a queue table. See Selecting a Primary Index for a Queue Table for a description of the primary index selection criteria you need to evaluate when choosing a primary index for a queue table.

Keep in mind that these criteria apply only to selecting a column set for the primary index. They do not apply to making a decision whether the primary index should be row-partitioned or not.

Be aware that with the exception of column-partitioned tables, Teradata Database assigns a default primary index to a table if you do not specify an explicit PRIMARY INDEX or NO PRIMARY INDEX in the CREATE TABLE request you use to create the definition for the table (see Primary Index Defaults).

Uniform Data Distribution

With respect to uniform data distribution, you should always consider the following factors.
  • The more distinct the primary index values, the better.
  • Rows having the same primary index value are distributed to the same AMP.
  • Parallel processing is more efficient when table rows are distributed evenly across the AMPs.

Optimal Data Access

With respect to optimal data access using a primary index, you should consider the following factors.
  • The primary index should be chosen on the most frequently used access path.

    For example, if rows are generally accessed by a range query, you should consider defining a partitioned primary index on the table or join index that creates a useful set of partitions.

    If the table is frequently joined with a specific set of tables, then you should consider defining the primary index on the column set that is typically used as the join condition.

  • Primary index operations must provide the full primary index value.
  • Primary index retrievals on a single value are always single-AMP operations.

Index Column Volatility

The primary index column set should be rarely, and preferably never, updated.

Criteria for Selecting a Primary Index

The following guidelines and performance considerations apply to selecting a unique or a nonunique column set as the primary index for a table.
  • Choose columns for the primary index based on the selection set most frequently used to retrieve rows from the table even when that set is not unique (if and only if the values of the selection set are fairly equally distributed across the AMPs).
  • Choose columns for the primary index that do not have XML, BLOB, CLOB, BLOB-based UDT, CLOB-based UDT, XML-based UDT, Period, JSON, ARRAY, VARRAY, or VARIANT_TYPE data types.

    Distinct and structured UDT columns are valid components of a primary index, but UDT columns based on internal Teradata UDT types, such as the Period, Geospatial, ARRAY, and VARRAY types, are not.

  • Choose columns for the primary index that distribute table rows evenly across the AMPs. The more singular the values for a column, the more optimal their distribution.
  • Choose as few columns as possible for the primary index to optimize its generality.

    All the columns in a composite primary index must be specified in a WHERE clause predicate before the Optimizer can select it for use as the retrieval mechanism.

  • If it is difficult to define a unique primary index for a table that must have one, you can generate arbitrary unique values for a single column if you define it as an identity column with the characteristics ALWAYS GENERATED and NO CYCLE.
  • Base the column selection on an equality search (if the primary index is a PPI, then the search is done within each non-eliminated populated partition). For equality constraints only, the system hashes directly to the row set that satisfies the condition.
Tables with this kind of primary key … Tend to assign the primary index to …
single-column the primary key.

This is referred to as a Unique Primary Index (UPI).

multicolumn one of the foreign key components of the primary key.

This is referred to as a Nonunique Primary Index (NUPI).

  • Primary and other alternate key column sets often can provide useful uniqueness constraints as well as a powerful access and join method when the logical design for a table is physically realized. If the primary or other alternate keys for a table are not selected to be its primary index, you should consider assigning a unique constraint, such as PRIMARY INDEX, UNIQUE, or a USI on those keys if the uniqueness constraint would facilitate table access and joins.

    This recommendation is contingent on a number of complicated factors that must be considered before implementing unique constraints. See Using Unique Secondary Indexes to Enforce Row Uniqueness for a list of the factors that should be considered when you consider implementing this recommendation.

A UPI … WHILE a NUPI …
at most involves one row can involve multiple rows.
does not require a spool often creates a spool.
  • Duplicate NUPI values are always stored on the same AMP and in the same data block if possible.
  • NUPI retrieval only requires one I/O operation (or two I/Os if the cylinder index is not memory-resident) when the rows are stored in the same data block.
This type of value range … Seen when using this predicate in a WHERE clause … Results in this kind of retrieval action …
implicit BETWEEN full table scan, irrespective of any indexes defined for the table.
The exceptions are the following:
  • PPI tables and join indexes, where row partition elimination can be exploited.
  • Hash and join index tables with a value-ordered NUPI, where value ordering can be exploited.
explicit IN individual row hashing.

Considerations for Choosing a Primary Index

Selecting the optimum primary index for a table or uncompressed join index is often a complex task because some applications might favor one type of primary index, while other applications might perform more optimally using a different primary index. Tables can have only one primary index, however, so you must select one that best suits the majority of the applications that a table serves. Of course, if the overhead costs justify the expense, you can define multiple join indexes with different primary indexes.

You can always add additional indexes, such as secondary, hash, and join indexes, to facilitate particular applications. Be aware that these indexes all incur various overhead costs, including:
  • Disk space required to store their subtables.
  • System performance degrades whenever base table rows are updated because the index values for any indexed columns affected by that update must also be updated.

You should always consider these tradeoffs when planning your indexes, then be sure to test them to ensure that the assumptions that lead to your choices are correct. For example, if you design a primary index with even row distribution as your principal criterion, analyze the actual distribution of table rows to ensure that they are evenly distributed.

For many applications, particularly those that use range queries heavily, a partitioned primary index can provide a better solution to resolving these issues than a nonpartitioned primary index because it provides efficient access both via the primary index columns as well as via a constraint on the partitioning columns. As always, you should confirm that the partitioning actually improves query performance by carefully examining EXPLAIN reports and collecting the appropriate statistics.

You should always collect statistics on the PARTITION column and the partitioning columns.

The recommended practice for recollecting statistics is to set appropriate thresholds for recollection using the THRESHOLD options of the COLLECT STATISTICS statement. See "COLLECT STATISTICS" in Teradata Vantage™ - SQL Data Definition Language Syntax and Examples, B035-1144 for details on how to do this.

You should also weigh the costs of the index against the benefits it provides. This is particularly important if you have also defined a USI on the table because additional maintenance is required to enforce uniqueness, thus potentially neutralizing or even reducing the overall performance advantage of the index.

Creating a partitioned table does not guarantee that row-partition elimination plan. A partitioning might not be used for any of the following common reasons:
  • It is not applicable to the actual queries in the workload.
  • The Optimizer cost analysis for a query determines that another plan is less expensive.
  • The query does not conform to any number of restrictions.

In some cases, a query plan with partitioning might not perform as well as one without partitioning.

Various partitioning strategies can be followed.
  • For some applications, defining the partition expressions such that each row partition has approximately the same number of rows might be an effective strategy.

    This task is far easier for single-level PPIs than for multilevel PPIs, though it can still be thought of as a goal to be approximated as best as possible.

  • For other applications, having a varying number of rows per partition might be desirable. For example, more frequently accessed data (such as for the current year) might be divided into finer partitions (such as weeks) but other data (such as previous years) may have coarser partitions (such as months or multiples of months).

    Note that partitioning in this manner can make altering the partitions more difficult.

  • Alternatively, defining each range with equal width, even if the number of rows per range varies, might be important.

The most important factors for row partitioning are accessibility and maximization of row partition elimination. In all cases, defining a primary index (or having no primary index) that distributes the rows of the table fairly evenly across the AMPs is critical for efficient parallel processing.

Partitioning Guidelines

The following guidelines provide a high-level set of criteria for making an initial evaluation of whether row partitioning would provide more benefits to a query workload than a nonpartitioned table.
  • Large tables and join indexes are usually better candidates for row partitioning than smaller tables and join indexes because there is not much benefit to partitioning a table or join index small enough that a full-table scan on the nonpartitioned table or join index takes only a few seconds.

    The exception to this is a small table that is row-partitioned identically to a larger table with which it is frequently joined and with which it shares its primary index.

  • When possible, you should row-partition on sets of columns that are frequently used as query conditions. For example, if half the queries against a table specify a date range that qualifies less than 25% of the rows, then that date column is a good candidate to be the partitioning column for the table.

    If there is no column that is frequently used as a query condition, then there is probably little or no advantage to row-partitioning the table.

  • All factors being equal, it is better to partition on a column set that is part of the primary index column set than to partition on a column that is not.

    The exception to this is if the primary index is rarely, if ever, used for row access or join operations.

  • Keep the number of row partitions relatively small. The key word in this guideline is relatively. The guideline also applies for multilevel partitioning situations, though it is more difficult to achieve for multilevel partitioning because the total number of partitions is a multiplicative factor of the number of partitioning levels defined for the table, so the number of partitions can grow very quickly even when there are few partitions defined for each level.

    The exception to this guideline is if the primary index is rarely, if ever, used for row access or direct merge joins.

    For example, if all the queries against the table access at least one month of activity, there is little or no benefit to partitioning by week or day instead of by month. An exception to this is if bulk data loading times are greatly reduced by a finer partition granularity.

    See Scenario 4 for an example of evaluating these sorts of tradeoffs.

  • Keep the number of partitions small even if you plan to expand predetermined table operations in the future. You can always increase the number of partitions later when they are needed.
    If you collect and maintain fresh statistics on the PARTITION columns of tables, this consideration is much less important.

    You have greater flexibility with this guideline for single-level partitioned tables than you do for multilevel partitioned tables because it can be rather complicated to decrease the number of partitions for a multilevel partitioning because the number of combined partitions defined for such a table increases multiplicatively with each partition and with each level defined.

  • The same criteria for selecting the column set for a nonpartitioned table also apply to partitioned tables.

    Choose a primary index column set that provides good row distribution, avoids skew, and is commonly used to access individual rows or do not use a primary index.

    Optimal row distribution and frequent access are sometimes conflicting considerations, so you must evaluate their relative merits and come to some compromise if that is the case.

Evaluating the Relative Merits of Partitioning Versus Not Partitioning

The following criteria provide a high-level means for evaluating the relative merits of partitioning or not partitioning for a table.

Potential advantages of row partitioning.
  • The greatest potential gain in row-partitioning a primary index is the ability to read a subset of table or join index rows instead of scanning them all.

    For example, a query that examines two months of sales data from a table that maintains two years of sales history can read about 1/12 of the table instead of having to scan it all.

    The advantages of row partition elimination can be even greater for multilevel partitioned tables (see the examples in the topic “Static Partition Elimination” in Teradata Vantage™ - SQL Request and Transaction Processing, B035-1142 for some remarkable scan optimizations).

    This provides the opportunity for a large performance boost to a wide range of queries. Importantly, the individuals who code those queries do not have to know the partitioning structure of the table and, as a result, there is no need to recode existing SQL applications.

  • Appropriate row partitioning can also facilitate faster batch data loads.

    For example, if a table is partitioned by transaction date, the loading of transactions for the current day can be dramatically enhanced, as can the deletion of rows from the table that are no longer necessary.

  • Row partitioning can make one or more existing secondary, hash, or join indexes redundant, which permits them to be dropped from the database.
Potential disadvantages of row partitioning.
  • Row partitioning can make single row (primary index) accesses to the table slower if a partitioning column is not a member of the primary index column set.

    This disadvantage can be offset somewhat by using one of the following strategies:

    • Choose a partitioning column that is a member of the primary index column set.
    • Define a unique secondary index that can be used to make single row accesses to the table.
    • Constrain the values of the partitioning column set to enable the Optimizer to eliminate some row partitions when the query search conditions permit.
  • Row partitioning can make direct merge joins of tables slower unless both tables are partitioned identically.

    This disadvantage can be offset when query search conditions allow some row partitions to be excluded from the join operation.

    The Teradata Database query optimizer has several special product join and merge join methods available to it just for joining row-partitioned tables. See Teradata Vantage™ - SQL Request and Transaction Processing, B035-1142 for descriptions and examples of these join methods.

As with other physical database design choices, you must always evaluate the respective tradeoffs of the decisions that are available to you by prototyping and testing their relative merits.