15.00 - Hashing Mechanisms - Teradata Database

Teradata Database Design

prodname
Teradata Database
vrm_release
15.00
category
User Guide
featnum
B035-1094-015K

Hashing Mechanisms

In the case of hashing, an index key data value is transformed by a mathematical function called a hash function to produce an abstract value not related to the original data value in an obvious way. Hashed data is assigned to hash buckets, which are memory‑resident routing structures that correspond in a 1:1 manner to the relationship a particular hash code or range of hash codes has with an AMP location.

Hashing is similar to indexing in that it associates an index key with a relative row address.

Hashing differs from indexing in the following ways.

  • There is no obvious correspondence between a hash code and the location of the row it refers to, even though the hash code identifies the location of the row.
  • Different data values can and often do produce identical hash codes. This results in what is called a hash collision, because any rows having the identical rowhash value would be stored in the same disk location if there were no means for preventing that from happening.
  • Ordinarily, the rows for primary‑indexed tables in Teradata Database are distributed, or horizontally row partitioned, among the AMPs based on a hash code generated from the primary index, or primary hash key, for the table (see “Row Allocation for Primary‑Indexed Tables” on page 235). The only exceptions to this are rows from global temporary trace tables, NoPI tables, column‑partitioned tables, and column‑partitioned join indexes (see “Row Allocation for Teradata Parallel Data Pump” on page 237 and “Row Allocation for FastLoad Operations Into Nonpartitioned NoPI Tables” on page 238).

    Because Teradata Database hash functions are formally mature and mathematically sound, rows with unique primary indexes are always distributed in a uniformly random fashion across the AMPs, even when there is a natural clustering of key values. This behavior both avoids nodal hot spots and minimizes the number of key comparisons required to perform join processing when two rows have the same rowhash value.

    All primary indexes are hashed for distribution, and while the stored value for the primary index is retained as data, it is the hashed transformation of that value that is used for distribution and primary index retrieval of the row. The primary indexes of hash and join indexes can, in some circumstances, be stored in value‑order (there are restrictions on the data type and column width of any value‑ordered primary index for a hash or join index: see “CREATE HASH INDEX” and “CREATE JOIN INDEX” in SQL Data Definition Language for details), but their rows are still distributed to the AMPs based on the hash of their primary index value.

    Nonpartitioned and column‑partitioned NoPI table rows that are inserted into Teradata Database using Teradata Parallel Data Pump ARRAY insert processing are distributed among the AMPs based on a hash code generated from their Query ID (see “Row Allocation for Teradata Parallel Data Pump” on page 237).

    Nonpartitioned NoPI table rows that are FastLoaded onto a system are distributed among the AMPs using a randomization algorithm that is different from the standard hashing algorithm (see “Row Allocation for FastLoad Operations Into Nonpartitioned NoPI Tables” on page 238).

    Unique secondary indexes are also hashed and distributed to their appropriate index subtables based on that hash value, where they are stored in rowID order. Rows stored in this way are said to be hash-ordered. Non‑primary index retrievals usually benefit from defining one or more secondary hash keys on a table.

    To summarize, although Teradata Database uses the term index for these values, they are really hash keys, not indexes.