A primary index (PI) or primary AMP index (PA) is an index that, in addition to facilitating access to data, also determines how table rows are distributed among the AMPs of the system. A properly chosen PI or PA can facilitate access to data for common queries, and help ensure that rows are distributed evenly, so each AMP manages a similar amount of data. A PI or PA is defined to use from one to 64 columns of a table for indexing. For each table row, the values of these index columns are combined and hashed. The resulting numeric hash value determines which AMP stores and manages the data in that row. For a PI but not for a PA, the hash value is also used in ordering rows on an AMP. A PA is only allowed for a column-partitioned table.
Data Access Considerations for Choosing a PI or PA
When you choose the columns for a PI or PA, it is important to consider the nature of queries likely to be made against the table. In almost all cases when a query specifies the values for each of the indexed columns, query performance will be faster than the same query if the columns are not indexed.
A PI or PA allows the database to identify the single AMP that stores the data of interest for the specified values of the index, without requiring a costly database search on all the AMPs that store the rows of the table (called a full-table scan or FTS). Moreover, for a PI (but not a PA), the rows of interest can be more quickly found on that AMP since the hash value is used in the ordering of the rows on an AMP.
Data Distribution Considerations for Choosing a PI or PA
A map associates a unique set of hash values with each of the different AMPs on which the table data is stored. For a PI or PA, the map also determines which AMP receives a particular table row based on a hash value calculated from the values of the index columns in the row.. For more information on maps, see Maps.
When the database receives a request, such as a query or data insertion, each AMP is responsible for processing its portion of table data. The AMPs work in parallel to facilitate processing. If one or more AMPs have significantly more or less data to process than others, the database is said to have a skewed data distribution, and performance suffers. Requests cannot be completed until the AMPs with the most data to process have finished their work, reducing the efficiency and benefits of the parallel processing. By choosing an appropriate PI or PA, you can help ensure that table rows are distributed evenly among the AMPs defined by the map used by the table, and take full advantage of Vantage parallel processing.
A unique primary index (UPI) requires each row in a table to have a unique value for the combined index columns. This type of index often corresponds to the primary key constraint or to a unique constraint on the table. Because of the uniqueness of the index value for each row, this type of index generally provides even distribution of the table rows.
UPIs can occasionally result in a skewed data distribution if the number of rows in the table is small relative to the number of AMPs in the map used by the table. In this case, you can create a map that contains fewer AMPs and assign it to the table.
Nonunique PIs and PAs
A nonunique primary index (NUPI) or a PA (all PAs are nonunique) neither requires nor enforces uniqueness on the index values. Choosing an inappropriate set of index columns for a NUPI or PA may result in a skewed data distribution so careful consideration of the choice of index columns and the table’s map is important.
To enforce uniqueness of the index columns of a NUPI or PA, a separate unique secondary index (USI) can be defined. See Secondary Indexes for a discussion of USIs.
Joins and Colocation
Tables that use the same map and that have the same columns defining their PIs or PAs can have rows with the same index values distributed to the same AMPs. This fact can be used to help optimize system performance by ensuring tables that are frequently joined have corresponding rows for the join distributed to the same AMPs. Performing a join operation on rows that are located on the same AMPs is faster than joining rows that are stored on different AMPs. When rows to be joined are located on the same AMP, they are said to be colocated.
For tables sharing a sparse map, the tables must also be defined to have the same colocation name. For more information on table colocation, see Sparse Maps and Table Colocation.
Tables with No Primary Index
Teradata recommends that you explicitly specify a PI for most tables without column partitioning and a PA for most column-partitioned tables. However, there are certain types of tables and situations for which a PI or PA is unnecessary and undesirable, due to index processing overhead or an appropriate set of index columns is not known. In these cases, tables can be created as no-primary-index (NoPI) tables.
An example of such tables would be staging tables, to which bulk data is batch loaded. Such tables are generally temporary, and, after loading, the data is typically transferred to a PI or PA table using INSERT ... SELECT statements.
To help ensure an even distribution of NoPI table rows among AMPs, the rows (or blocks of rows) are randomly distributed during a load. This helps to assure a relatively even row distribution among the AMPs. For an INSERT ... SELECT statement, rows are locally transferred by default which might cause skew; optionally, a HASH BY clause can be specified to help ensure an even distribution of rows.