Hashing Mechanisms - Teradata Database

Teradata Database Design

Product
Teradata Database
Release Number
15.10
Language
English (United States)
Last Update
2018-10-06
Product Category
Software

Hashing Mechanisms

In the case of hashing, an index key data value is transformed by a mathematical function called a hash function to produce an abstract value not related to the original data value in an obvious way. Hashed data is assigned to hash buckets, which are memory‑resident routing structures that correspond in a 1:1 manner to the relationship a particular hash code or range of hash codes has with an AMP location.

Hashing is similar to indexing in that it associates an index key with a relative row address.

Hashing differs from indexing in the following ways.

  • There is no obvious correspondence between a hash code and the location of the row it refers to, even though the hash code identifies the location of the row.
  • Different data values can and often do produce identical hash codes. This results in what is called a hash collision, because any rows having the identical rowhash value would be stored in the same disk location if there were no means for preventing that from happening.
  • Ordinarily, the rows for primary‑indexed and primary-AMP indexed-tables in Teradata Database are distributed, or horizontally row partitioned, among the AMPs based on a hash code generated from the primary or primary AMP index or primary hash key for the table (see “Row Assignment for Primary‑Indexed Tables” on page 183). The only exceptions to this are rows from global temporary trace tables, NoPI tables, and NoPI join indexes (see “Row Assignment for NoPI Tables” on page 185 and “Row Assignment for FastLoad Operations Into Nonpartitioned NoPI Tables” on page 186).

    Because Teradata Database hash functions are formally mature and mathematically sound, rows with unique primary indexes are always distributed in a uniformly random fashion across the AMPs, even when there is a natural clustering of key values. This behavior both avoids nodal hot spots and minimizes the number of key comparisons required to perform join processing when two rows have the same rowhash value.

    All primary indexes and primary AMP indexes are hashed for distribution, and while the stored value for the index is retained as data, it is the hashed transformation of that value that is used for distribution and primary index or primary AMP index retrieval of the row. The primary indexes of hash and join indexes can, in some circumstances, be stored in value‑order (there are restrictions on the data type and column width of any value‑ordered primary index for a hash or join index: see “CREATE HASH INDEX” and “CREATE JOIN INDEX” in SQL Data Definition Language for details), but their rows are still distributed to the AMPs based on the hash of their primary index value.

    NoPI table rows that are inserted into Teradata Database using Teradata Parallel Data Pump ARRAY insert processing are distributed among the AMPs based on a hash code generated from their Query ID (see “Row Assignment for NoPI Tables” on page 185).

    Nonpartitioned NoPI table rows that are FastLoaded onto a system are distributed among the AMPs using a randomization algorithm that is different from the standard hashing algorithm (see “Row Assignment for FastLoad Operations Into Nonpartitioned NoPI Tables” on page 186).

    Note: FastLoad cannot be used to load rows into a partitioned NoPI table.

    Unique secondary indexes are also hashed and distributed to their appropriate index subtables based on that hash value, where they are stored in rowID order. Rows stored in this way are said to be hash-ordered. Non‑primary index and non-primary AMP index retrievals usually benefit from defining one or more secondary hash keys on a table.

    To summarize, although Teradata Database uses the term index for these values, they are really hash keys, not indexes.