Advantages of Hashing Over Indexing
Why use hashing instead of indexing? The following list of advantages explains why
hashing is superior to indexing for the Teradata parallel architecture.
For the majority of queries, hashing provides consistently better performance because
rows are always distributed evenly across the AMPs, providing a consistently balanced
workload across parallel units. Because Teradata Database uses B*-trees on a fixed‑length
rowID rather than on the column values themselves, there are never more than two levels
of the B*-tree, so there is no need to reorganize because the trees never become unbalanced
and unordered. This is true for standard operation. System reconfiguration is necessary whenever you add nodes or AMPs to your system.
The Teradata file system B*-tree is implemented on the master index, cylinder index,
and data block descriptor structures (see “Master Index” on page 246, “Cylinder Index” on page 247, and “Data Block” on page 248).
B+ trees on variable‑length values can become unbalanced and then require a lengthy
Primary indexes are not stored in an index subtable. Instead, the rowhash for an individual
row is stored directly as part of the row header for the row. This direct access to
row data offers the following advantages over indexing of the same data.
Enhanced performance because of fewer I/Os
Decreased storage overhead because there are no primary index subtables
Hashing is far better suited for the parallel database architecture pioneered by Teradata
than is indexing. Because each table in a Teradata database is distributed horizontally
across the AMPs on a row-by-row basis, the storage and retrieval algorithms and indexes
used by standard file systems for storing and retrieving data are not as efficient
as the distributed parallel Teradata architecture.
Because of its hashed data placement and parallel architecture, Teradata Database
requires only a one-time decision on the primary index for each table: the system
takes care of everything else.
Compare this with other relational products where a DBA must perform tasks such as
determining keys and locations and coding space allocation for each partition before
even beginning to create a table. Then the DBA must embed the partitioning assignments
into the CREATE TABLE statement and forever after monitor and maintain partition sizes
and perform periodic table reorganizations.
By carefully defining primary index columns on frequently used join constraints, you
can force rows that are frequently joined to be co-located on the same AMP. This is
critical to join performance, because rows must be on the same AMP if they are to
be joined. By co-locating join‑constrained rows, you avoid the need to redistribute
the rows from one or more tables across the BYNET to the AMP where they will be joined
with rows located on that AMP.