Hash Join | Join Planning/Optimization | Teradata Vantage - Hash Join - Advanced SQL Engine - Teradata Database

SQL Request and Transaction Processing

Product
Advanced SQL Engine
Teradata Database
Release Number
17.05
17.00
Published
June 2020
Language
English (United States)
Last Update
2021-01-24
dita:mapPath
ykx1561500561173.ditamap
dita:ditavalPath
lze1555437562152.ditaval
dita:id
B035-1142
lifecycle
previous
Product Category
Teradata Vantage™

About the Hash Join

Hash join is a method that performs better than merge join (see Merge Join) and equality product join (see Product Join) under certain conditions. Hash join is applicable only to equijoins. The performance enhancements gained with the hybrid hash join comes mainly from eliminating the need to sort the tables to be joined before performing the actual join operation. In some circumstances, an in-memory hash join provides better performance by making better use of the cache, bulk join condition evaluation, and the use of SIMD instructions.

Unless otherwise noted, the general information about hash joins in the hash join sections also applies to in-memory hash joins.
Vantage can use the various forms of hash join to handle the following join types:
  • Hash inner joins (classic, dynamic hash join, and dynamic hash join with dynamic row partition elimination)
  • Hash inclusion semijoins (dynamic hash join only; does not apply to in-memory hash joins)
  • Hash exclusion semijoins (dynamic hash join only; does not apply to in-memory hash joins)
  • Left, right, and full outer hash joins (classical and dynamic hash joins only)
  • Hash joins with dynamic row partition elimination (dynamic hash join only)

See Classic Hash Join and Dynamic Hash Joins for further information about these hash join types.

Depending on whether spooling the large table in the join might be avoided, the Optimizer might substitute a dynamic hash join (see Dynamic Hash Joins) or an in-memory dynamic hash join (see Dynamic In-memory Hash Joins) in place of a standard hash join. If join conditions are specified on a column set that is not the primary index, relocation of rows must be completed prior to the sort operation.

Hash joins, like other join methods, perform optimally when the statistics on the join columns are current. This is particularly important for hash join costing to assist the Optimizer in detecting skew in the data (see Effects of Data Skew on Hash Join Processing for details about the negative effect of skewed data on hash join performance).

Like other join methods, hash joins are costed, and the Optimizer compares their cost to the cost of other eligible join methods and plans to determine the least costly plan for a query.

The Optimizer also decides whether or not to create in-memory optimized spools.

In-memory optimized spools are column-partitioned spools with a maximum of 4 column partitions, with the following contents:
  • A partition for the hash values.
  • A partition for the join column with the best selectivity that is used in a equality join condition of fixed length.
  • A partition for all the fixed length columns used in the equality join condition.
  • A partition for all the variable-length columns, residual join condition columns, and projected columns.

Hash Join Terminology

Description of the hash join method requires several new terms, which are defined in the following table.

Term Definition
Build Table The smaller of the 2 join tables, so named because it is used to build the hash table.
Fanout The maximum number of hash join partitions to be created.

When rows are fanned out, they are partitioned into several hash partitions.

Hash join or In-memory hash join A join algorithm in which a hash table is built for the smaller of the two tables being joined based on the join column set. The larger table is then used to probe the hash table to perform the join.
Hash table A memory-resident table composed of several hash buckets, each of which has an ordered linked list of rows belonging to a particular hash range.
In-memory hash table A memory-resident table built from left input, designed to increase cache efficiency during join comparisons. It is also designed to enable SIMD instruction execution.
Probe table The larger of the two join tables in the hash join.

Rows from this table are used to probe rows in the hash table for a hash match before the join conditions are evaluated.

Partition A segment of a table in a hash join.

Tables are segmented into a number of hash join partitions.

Hash join partitions can be memory-resident, in which case they also constitute the hash table, or a spool.

The limit on the number of hash join partitions for a hash join operation is 50.

Note that hash join partitions have no relationship to the partitions of a row-partitioned or column-partitioned table or join index (see Teradata Vantage™ - SQL Data Definition Language Syntax and Examples, B035-1144 and Teradata Vantage™ - Database Design, B035-1094). They are entirely different, unrelated things.

Spillover During the hash table building process, Teradata Database allocates one segment at a time for storing the hash table rows.

When the maximum number of allocated segments (MaxHTSegs) is reached, spillover occurs and the hash join process must start by using the right row hash to probe the hash table looking for a match row to join.

Hash Joins and Performance

The Optimizer might use a hash join instead of a merge join for better performance in the following situations.
  • At least one join key is not indexed.
  • To provide a performance improvement for the join step.

By using a hash table, hash joins eliminate the sort used prior to a merge join.

See Controlling the Size of Hash Tables and In-memory Hash Tables for more information about optimizing the performance of hash joins.

Classic Hash Join

Classic hash join is applicable when the entire build table fits into available memory. The system reads rows from the build relation directly into a memory-resident hash table. The process then reads rows from the probe table and uses them to individually probe the hash table for a match. A result row is returned for each hash match that satisfies all the join conditions. No partitioning, and thus no partitioning-related I/O, is incurred in this case. Consequently, it is the fastest form of hash join.

Vantage uses a variant of classic hash join commonly referred to as hybrid hash join, as well as a variant of the hybrid hash join referred to as dynamic hash join (see Dynamic Hash Joins). The method deals with build tables that are too large to fit into available memory by dividing them into chunks called hash join partitions that are small enough to fit into memory (see Partitioning the Smaller Table of a Hash Join Pair for details about how the system does this).

The classic hash join can be applied to left, right, and full outer joins as well as inner joins.

The process applied by the hybrid hash join algorithm is provided by the following process:
  1. Read a row from the right table, which is an unsorted spool containing the row hash value for each row as well as its row data.
  2. Match each right row with all the left table rows having the same row hash.
  3. Join the rows.

The following graphic illustrates the hybrid hash join process:



Hash bucket entries cluster the entries with the same rowhash value together. This facilitates faster searches for a row using its row hash value because each hash bucket entry only needs to be visited once.

Classic In-memory Hash Join

All of the considerations described for classic hash joins also apply to in-memory hash joins. There are a few differences, one of which is that more memory may be used for an in-memory hash join.

The in-memory hash join algorithm is designed for bulk processing; that is, each step of the process that is described in Classic Hash Join is enhanced to operate on multiple rows. This algorithm helps to improve the memory access pattern and thus, cache efficiency.

Another difference is in the hash table data structure. The hash table structure for in-memory hash joins follows the in-memory spool partition format and stores values for each partition together.

Partitioning the Smaller Table of a Hash Join Pair

A hash table is derived from the smaller table in a hash join pair. It is a single-partition, memory-resident data structure that contains a hash array as well as the rows in the larger table that the hash array points to. The system configures the smaller table in a hash join operation as an ordered linked list of rows that belong to a particular range of hash values. Depending on is size, this table can be decomposed into several smaller partitions.

When the smaller table is too large to fit into the memory available for hash join processing, the system splits it into several smaller partitions. Each partition is small enough to fit into the available space. Partition size is controlled by the settings of several parameters in the DBS Control record (see Hash Join Control Variables).

A table can be divided into a maximum of 50 partitions. Partitioning avoids the maintenance overhead that would otherwise be required for virtual memory management whenever a hash bucket overflow condition occurs.

The system segments the smaller table using an algorithm that uses the join columns to hash rows into the number of hash join partitions required to make the join.

For example, suppose 6 hash join partitions are needed to make the join. That is, the number of qualifying rows for the smaller table is six times larger than the largest single hash join partition that can fit into the available memory. The system then hashes each row into one of the 6 hash join partitions.

The system spools and partitions the larger table in the same way. Although the partitions for the large table are also larger, they need not fit into memory. When the system makes the join, it brings a hash join partition of the smaller table, which is copied to a spool, into memory. Rows from the matching hash join partition in the other table, which is also in a spool, can then be matched to the rows in memory.

Note that a row from one hash join partition cannot match a row in the other table that is in a different hash join partition because their respective hash values are different.

Each left table hash join partition is then hash-joined in turn with its corresponding right table partition. The graphic in the section on Classic Hash Join shows partition 2 of the triple-partitioned left table being hash-joined with hash join partition 2 of the triple-partitioned right table.

The process of creating the hash join partitions is referred to as fanning out in EXPLAIN reports (see Hash Join Examples, where the phrases that describe the hash join and the partitioning of the hash join tables are highlighted in boldface).

Effects of Data Skew on Hash Join Processing

Data skew in the build table can seriously degrade the performance of hash joins. One of the premises of hash join is that the hashing algorithm is good enough to ensure that the build relation can be reduced into relatively equivalent-sized partitions. When there is a large quantity of duplicate row values in the build table, the hash partitioning algorithm might not partition it optimally. Skew in the probe table can also degrade performance if it results in the probe table being smaller than the build table.

To make allowances for possible skew in either table in a hash join operation, you can use the DBS Control utility to size build table partitions proportionately to their parent hash table (see Hash Join Control Variables).

If the specified skew allowance is insufficient to correct for data skew, and hash table bucket overflow occurs, then the system matches rows from the corresponding probe table against build table rows that are already loaded into the memory-resident hash table. After all the probe partition rows have been processed, the system clears the hash table and moves more rows from the oversized build table partition into memory. The system then re-reads rows from the probe partition and matches them against the freshly loaded build table rows. This procedure repeats until the entire build partition has been processed.

For in-memory hash joins, data skew in the build table can also degrade join performance. Skew is more problematic for in-memory hash joins due to the higher default size for an in-memory hash table. With higher hash table sizes, the Optimizer may choose direct access for the right table. But if there is hash table overflow, then a larger right table must be scanned multiple times. When using smaller hash table sizes, the planner may choose to partition the two inputs if the available memory is deemed insufficient to hold the entire left input. During in-memory hash join planning, the Optimizer adjusts the cost more conservatively for skew by inflating the row estimate.

Controlling the Size of Hash Tables and In-memory Hash Tables

You can control the size of the hash table using the HTMemAlloc and HTMemAllocBase DBS Control fields (see Hash Join Control Variables). If you specify a value of 0, the system cannot build a hash table. This effectively turns off hash join, and the Optimizer does not consider the method when it is doing its join plan evaluations.

If you need to increase the size of the in-memory hash table, contact your Teradata Global Support Center representative.

Hash Join Control Variables

You can access the hash join-related parameters HTMemAlloc and SkewAllowance using the DBS Control utility (see Teradata Vantage™ - Database Utilities , B035-1102 ). Use them to optimize the performance of your hash joins. HTMemAllocBase is an internal DBS Control field that can only be accessed by Teradata support personnel. Contact your technical support team if you suspect the value of this field needs to be changed.

HTMemAlloc does not apply to in-memory hash joins; SkewAllowance is used for in-memory hash joins.
DBS Control Field Function
HTMemAlloc Varies the hash table size by calculating a percentage of the HTMemAllocBase value. The larger the specified percentage, the larger the hash table. The default is 10. A setting of zero prevents hash joins from being used as optimizations.

This setting should be changed only under the direction of the Teradata Global Support Center.

HJ2IMHJ Determines whether the Optimizer preferentially chooses the in-memory hash join method for performing hash joins and dynamic hash joins. The default value is FALSE (not enabled).

When set to FALSE, the optimizer costs an in-memory hash join as a new join method and picks it based on cost. When set to TRUE, the optimizer uses the in-memory hash join method whenever possible, in preference to any other hash join method, regardless of cost.

SkewAllowance Provides for data skew by making the size of each partition smaller than the hash table.

Valid input values range from 20 to 80 (percent).

The default is 75 (percent), indicating that the partition size is set to 25 percent of the hash table to allow the actual partition to be four times more than the estimate made by the Optimizer before it is too large to fit into the hash table.

For more information about DBS Control fields, see Teradata Vantage™ - Database Utilities , B035-1102 .

Hash Join Examples

The Optimizer decides to hash join the Employee and Department tables on the equality condition Employee.Location = Department.Location in this query. The EXPLAIN text indicates that the hash tables in Spool 2 (step 4) and Spool 3 (step 5) are segmented (fanned out) into 22 hash join partitions (see Partitioning the Smaller Table of a Hash Join Pair).

Hash table memory allocation is set at 5% and skew allowance is set at 75% (see Effects of Data Skew on Hash Join Processing) for the partial EXPLAIN.

EXPLAIN
SELECT employee.empnum, department.deptname, employee.salary
FROM employee, department
WHERE employee.location = department.location;

The following shows a portion of the EXPLAIN output:

 ...
3) We lock PERSONNEL.Department for read, and we lock PERSONNEL.Employee for read.
4) We do an all-AMPs RETRIEVE step from PERSONNEL.Employee by way of an all-rows scan with no residual conditions into  Spool 2 fanned out into 22 hash join partitions, which is redistributed by hash code to all AMPs. The size of Spool 2 is estimated to be 3,995,664 rows. The estimated time for this step is 3 minutes and 54 seconds.
5) We do an all-AMPs RETRIEVE step from PERSONNEL.Department by way of an all-rows scan with no residual conditions into  Spool 3 fanned out into 22 hash join partitions, which is redistributed by hash code to all AMPs. The size of Spool 3 is estimated to be 4,000,256 rows. The estimated time for this step is 3 minutes and 54 seconds.
6) We do an all-AMPs JOIN step from Spool 2 (Last Use) by way of an all-rows scan, which is joined to Spool 3 (Last Use).  Spool 2 and Spool 3 are joined using a hash join of 22 partitions, with a join condition of (“Spool_2.Location = Spool_3.Location”). The result goes into Spool 1, which is built locally on the AMPs. The result spool field will not be cached in memory. The size of Spool 1 is estimated to be 1,997,895,930 rows. The estimated time for this step is 4 hours and 42 minutes.
7) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 4 hours and 49 minutes.
DBS Control Record - Performance Fields:
HTMemAlloc       =  5%
SkewAllowance    = 75%

The partial EXPLAIN from the following example shows how the Optimizer handles an in-memory hash join:

EXPLAIN
SELECT pit103.c2,pit104.c3,cpt1.c2
FROM pit104,pit103,cpt1
WHERE cpt1.c4 = pit103.c4 AND pit103.c3 = pit104.c4 AND cpt1.c2 > 100;
...
  5) We execute the following steps in parallel.
       1) We do an all-AMPs RETRIEVE step from TEST.pit104 by way of an
          all-rows scan with a condition of ("NOT (TEST.pit104.c4 IS
          NULL)") into Spool 2 (all_amps), which is built locally on
          the AMPs. Spool 2 is built as in-memory optimized spool with 
           3 column partitions. The size of Spool 2 is estimated with
          low confidence to be 20 rows (500 bytes). The estimated time
          for this step is 0.37 seconds.
       2) We do an all-AMPs RETRIEVE step from TEST.pit103 by way of an
          all-rows scan with a condition of ("(NOT (TEST.pit103.c4 IS
          NULL )) AND (NOT (TEST.pit103.c3 IS NULL ))") into Spool 3
          (all_amps), which is duplicated on all AMPs. Spool 3 is
          built as  in-memory optimized spool with 3 column partitions.
          The size of Spool 3 is estimated with low confidence to be
          400 rows (11,600 bytes). The estimated time for this step is
          0.37 seconds.
  6) We do an all-AMPs JOIN step from Spool 2 (Last Use), which is
     joined to Spool 3 (Last Use). Spool 2 and Spool 3 are joined
     using a  single partition in-memory hash join, with a join
     condition of ("c3 = c4"). The result goes into Spool 4 (all_amps),
     which is built locally on the AMPs. Then we do a SORT to order
     Spool 4 by the hash code of (TEST.pit103.c4). The size of Spool 4
     is estimated with no confidence to be 90 rows (2,250 bytes). The
     estimated time for this step is 0.33 seconds.
  7) We do an all-AMPs BULK RETRIEVE step from 3 column partitions of
     TEST.cpt1 with a condition of ("(TEST.cpt1.c2 >= 101) AND (NOT
     (TEST.cpt1.c4 IS NULL ))") into Spool 5 (all_amps), which is
     duplicated on all AMPs. Then we do a SORT to order Spool 5 by the
     hash code of (TEST.cpt1.c4). The size of Spool 5 is estimated
     with no confidence to be 140 rows (2,940 bytes). The estimated
     time for this step is 0.11 seconds.
  8) We do an all-AMPs JOIN step from Spool 4 (Last Use) by way of a
     RowHash match scan, which is joined to Spool 5 (Last Use) by way
     of a RowHash match scan. Spool 4 and Spool 5 are joined using a
     merge join, with a join condition of ("c4 = c4"). The result goes
     into Spool 1 (group_amps), which is built locally on the AMPs.
     The size of Spool 1 is estimated with no confidence to be 67 rows
     (3,618 bytes). The estimated time for this step is 0.56 seconds.
...

Dynamic Hash Joins

Dynamic hash join provides the ability to do a binary or n-way equality join directly between one or more small tables (or relations) and a large table on non-primary index columns without placing the large table into a spool. For dynamic hash join to be used, each of the small tables must be small enough to fit in a single hash join partition.

Dynamic hash join can be used only when two tables are joined based on non-primary index columns, and small tables, referred to as the left tables, are very small compared to the large table, referred to as the right table.

The process is as follows:
  1. Duplicate the smaller tables.
  2. Place each of the smaller tables in a hash array
  3. Read a row from the right table
  4. Compute the row hash code
  5. Match each right row with all rows in each of the hash arrays having the same row hash.
  6. Join the matching rows.

The dynamic hash join can be applied to left, right, and full outer joins as well as inner joins, and can also take advantage of equality conditions for dynamic partition elimination.

The inclusion hash join and the exclusion hash join can also use dynamic partition elimination when you specify an equality condition.

Dynamic In-memory Hash Joins

All of the considerations described for dynamic hash joins also apply to in-memory hash joins. There are a few differences, one of which is that the amount of memory in an in-memory hash join can go up to the default value of 100 MB or up to the value set (see Controlling the Size of Hash Tables and In-memory Hash Tables).

The in-memory dynamic hash join algorithm is designed for bulk processing, and enhanced to operate on multiple rows. This algorithm helps to improve the memory access pattern and thus, cache efficiency.

The hash table data structure that is used for an in-memory dynamic hash join is the same as that used for the classic in-memory hash join.

AllRowsOneAMP In-Memory Hash Joins

This type of hash join is one in which the left table is duplicated to one AMP. The in-memory hash join step reads the rows from one AMP, builds the hash table, and duplicates the hash table segments to all AMPs. All the AMPs receive the hash table segments and continue with probe and output building. One copy of the hash table segments are shared by all the AMPs within a node.

The in-memory hash join is the only consumer of AllRowsOneAMP spool.

This improves I/O performance, compared to using duplicate spools for in-memory hash joins. The performance improvement increases as the number of AMPs in the system increases.

There are two modes of execution:
  • In default mode, a classic or dynamic in-memory hash join where the left table is duplicated will become an AllRowsOneAMP in-memory hash join. (This is not the case for DPE in-memory hash joins and in-memory hash joins that are part of PRPD joins.)
  • In resource consumption mode, in addition to default mode behavior, a classic hash join where both tables are redistributed will become an AllRowsOneAMP-Direct or AllRowsOneAMP-Local in-memory hash join. The decision between Direct or Local is cost-based.

In both modes, the Optimizer makes heuristic decisions. Resource consumption mode should be used only to improve resource consumption (I/O, CPU) at the expense of the elapsed time. Resource consumption mode is disabled by default. To enable it, contact the Teradata Support Center.

Hash Joins, In-Memory Hash Joins, and Performance

The Optimizer may use a hash join instead of a merge join of tables for better performance if at least one join key is not indexed.

The hash join eliminates the sort used prior to the merge join by using a hash table instead.

You can optimize system usage of hash joins with the following DBS Control fields:

DBS Control Field Best Initial Setting
HTMemAlloc
Changing HTMemAlloc does not affect in-memory hash joins.
  2
SkewAllowance 75

For information on setting hash join-related DBS Control fields, see the DBS Control fields “HTMemAlloc,” “HJ2IMHJ,” and “Skew Allowance” in Teradata Vantage™ - Database Utilities , B035-1102 .

For information on strategies for using hash join-related DBS Control fields, see “Adjusting Skew Allowance” in Teradata Vantage™ - Database Administration, B035-1093.

Recommendations for In-Memory Hash Joins

The following are recommendations when using in-memory hash joins:
  • Make the better selective column a fixed-length column and make the fixed-length column either a column partition by itself or a narrow partition.
  • Collect multicolumn statistics if join conditions involve multiple columns because incomplete statistics can make the cardinality of the table incorrect. This situation can lead the Optimizer to think that the table can fit in the available memory (by default, 100 MB) and therefore it does not cost hash table overflows, which eventually lead to a lesser estimated cost. This same situation may not occur with a hash join, since it uses less memory for building the hash table and it also uses fan-out, which performs better. Based on these facts, it is recommended that you collect multicolumn statistics on the join columns from the left table.
  • Keep your statistics updated to avoid the problem mentioned in the previous bullet.

Outer Hash Joins

Teradata Database supports left outer hash joins, right outer hash joins, and full outer hash joins. These types of joins support in-memory hash join methods. The processing for each of these join types is described in the topics that follow.

Exclusion Hash Join

An exclusion hash join is a hash join where only the rows that do not satisfy (meaning that they are NOT IN) any condition specified in the request are joined. In other words, exclusion hash join finds rows in the first table that do not have a matching row in the second table.

Exclusion hash join is an implicit form of outer join. Exclusion hash joins are not supported for in-memory hash joins.

Inclusion Hash Join

An inclusion hash join is a hash join where the first right relation row that matches the left relation row is joined to it.

Inclusion hash joins are not supported for in-memory hash join methods.