Advantages of Hashing Over Indexing
Why use hashing instead of indexing? The following list of advantages explains why
hashing is superior to indexing for the Teradata parallel architecture.
For the majority of queries, hashing provides consistently better performance because
rows are always distributed evenly across the AMPs, providing a consistently balanced
workload across parallel units. Because Teradata Database uses B*-trees on a fixed‑length
rowID rather than on the column values themselves, there are never more than two levels
of the B*-tree, so there is no need to reorganize because the trees never become unbalanced
and unordered. This is true for standard operation. System reconfiguration is necessary whenever you add nodes or AMPs to your system.
The Teradata file system B*-tree is implemented on the master index, cylinder index,
and data block descriptor structures (see “Master Index” on page 194, “Cylinder Index” on page 195, and “Data Block” on page 196).
B+ trees on variable‑length values can become unbalanced and then require a lengthy
reorganization.
Primary indexes or primary AMP indexes are not stored in an index subtable. Instead,
the rowid for an individual row is stored directly as part of the row header for the
row. This access to data offers the following advantages over indexing of the same
data.
Enhanced performance because of fewer I/Os
Decreased storage overhead because there are no primary index or primary AMP index
subtables
Hashing is far better suited for the parallel database architecture pioneered by Teradata
than is indexing. Because each table in a Teradata database is distributed horizontally
across the AMPs on a row-by-row basis, the storage and retrieval algorithms and indexes
used by standard file systems for storing and retrieving data are not as efficient
as the distributed parallel Teradata architecture.
Because of its hashed data placement and parallel architecture, Teradata Database
requires only a one-time decision on the primary index or primary AMP index for each
table: the system takes care of everything else.
Compare this with other relational products where a DBA must perform tasks such as
determining keys and locations and coding space allocation for each partition before
even beginning to create a table. Then the DBA must embed the partitioning assignments
into the CREATE TABLE statement and forever after monitor and maintain partition sizes
and perform periodic table reorganizations.
By carefully defining primary index or primary AMP index columns on frequently used
join constraints, you can force rows that are frequently joined to be colocated on
the same AMP. This is critical to join performance, because rows must be on the same
AMP if they are to be joined. By colocating join‑constrained rows, you avoid the need
to redistribute the rows from one or more tables across the BYNET to the AMP where
they will be joined with rows located on that AMP.