Merge Join | Join Planning/Optimization | Teradata Vantage - Merge Join - Analytics Database - Teradata Vantage

SQL Request and Transaction Processing

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Analytics Database
Teradata Vantage
Release Number
17.20
Published
June 2022
Language
English (United States)
Last Update
2024-10-04
dita:mapPath
zfm1628111633230.ditamap
dita:ditavalPath
qkf1628213546010.ditaval
dita:id
evd1472255317510
lifecycle
latest
Product Category
Teradata Vantageā„¢

A merge join retrieves rows from two tables by processing equality joins that probe one table using row key or row hash values from the qualified rows in the other table. A merge join assumes that the probed table is sorted either by row key or row hash. The operation then puts the joined rows onto a common AMP based on the row key or row hash of the join columns. The database sorts the rows into join column row key or row hash sequence, then joins those rows that have matching join column row key or row hash values.

In a merge join, the columns on which tables are matched are also the columns on which both tables, or redistributed spools of tables, are ordered. Merge join is generally more efficient than a product join (see Product Join) because merge join requires fewer comparisons and because blocks from both tables are read only once.

Merge Join Methods

Each of these general merge join algorithms, described in more detail in the sections that follow, can also be applied to the merge join methods.
  • Fast path inner merge join

    Fast path inner inclusion merge join

  • Slow path inner merge join

    Slow path inner inclusion merge join

  • Exclusion merge join
  • Fast path left outer merge join

    Fast path left outer inclusion merge join

  • Slow path left outer join

    Slow path left outer inclusion merge join

  • Fast path right outer merge join
  • Slow path right outer merge join
  • Full outer merge join
Another class of merge join methods is used only with row-partitioned tables.
  • Direct merge join
  • Rowkey-based merge join
  • Single window merge join
  • Sliding window merge join

Dynamic row partition elimination is not supported for merge joins of multilevel partitioned tables. For more information, see Product Joins With Dynamic Row Partition Elimination and Teradata Vantageā„¢ - Database Design, B035-1094.

Approaches to Merge Joining Row-partitioned PI Table to Unmatched Relation

The Optimizer has the following general approaches when joining a row-partitioned PI table to a nonpartitioned table or when joining two row-partitioned PI tables with different partitioning expressions:
  • Spool the row-partitioned PI table (or both row-partitioned PI tables) into a nonpartitioned spool in preparation for a traditional merge join.
  • Spool the nonpartitioned table (or one of the two row-partitioned PI tables) into a row-partitioned PI spool, with identical partitioning to the remaining table, in preparation for a rowkey-based merge join (see Rowkey-Based Merge Join).

    This option is not always available.

  • Use the sliding window merge join of the tables without spooling either one (see Sliding-Window Merge Join).

    In this case, both tables must have the same primary index and the join must be an equality join between the primary index columns and neither can be column-partitioned.

In all cases, the Optimizer considers all reasonable join strategies and selects the one with the least estimated cost.

Generic Merge Join Strategy

The following is the high level process applied by the merge join algorithm:

  1. Identify the smaller relation of the pair to be joined.
  2. The Optimizer pursues the following steps only if necessary to place qualified rows into a spool.
    1. Place the qualifying rows from one or both relations into separate spools.

      As the qualified rows are placed into spool, relocate those rows to their target AMPs based on the rowkey or hash of the join column set.

    2. Sort the qualified spool rows on their join column rowkey or row hash values.
  3. Compare the relocated row set with matching join column row hash values in the other relation.

Dynamic Row Partition Elimination for Merge Join with Equality Join Terms

When there are equality join terms on all primary index columns of a single-level row-partitioned PI table plus certain forms of range join terms on the partitioning column, then the Optimizer may use a merge join with dynamic row partition elimination to make the join.

You can think of this as a form of sliding window merge join (see Sliding-Window Merge Join) with the row-partitioned PI table in which the other table being joined is partitioned the same way as the row-partitioned PI table, and a subset of the partitions in one table is joined to a subset of the row partitions in the other table as determined by the specified range terms.

Dynamic row partition elimination-applicable join terms must be specified in one the following forms:

{ ppi_t1.partitioning_column_1 + C1 operator  ppi_t2.partitioning_column_2 |
  ppi_t2.partitioning_column_2 operator   ppi_t1.partitioning_column_1 + C2 |
  ppi_t1.partitioning_column_1 + C1 operator ppi_t2.partitioning_column_2 |
  ppi_t2.non_partitioning_column_2 operator ppi_t1.ppi_column_1 + C2
}
ppi_t1.partitioning_column_1
A partitioning column for table ppi_t1.
C1
C2
A constant.
operator
One of the following join operators:
  • <
  • <=
  • =>
  • >
ppi_t2.partitioning_column_2
A partitioning column for table ppi_t2.
ppi_t2.non_partitioning_column_2
A nonpartitioning column for table ppi_t2.
ppi_t1.ppi_column_1
A column for table ppi_t1.

Merge Join with Dynamic Row Partition Elimination Not Supported for Character-Partitioned Tables

Because you cannot arithmetically add strings together, and because string intervals are not defined, merge join dynamic row partition elimination is not supported for character-partitioned tables.

Slow Path Inner Merge Join

The process applied by the slow path merge join algorithm is as follows:

  1. Read each row from the left table.
  2. Join each left table row with the right table rows having the same hash value.

Fast Path Inner Merge Join

The process applied by the fast path merge join algorithm is as follows:

  1. Read a row from the left table and record its hash value.
  2. Read the next row from the right table that has a row hash >= to that of the left table row.
Row Hash Values Optimizer Action
Equal Join the two rows.
Not equal Use the larger row hash value to read the row from the other table.

The following graphic illustrates the generic fast path merge join process:



Merge Join Distribution Strategies

Merge joins use one of the following distribution strategies:

Strategy Number Strategy Name Stage Process
1 Hash Redistribution 1 Hash redistribute one or both sides to appropriate number of AMPs (depending on the primary indexes used in the join).

If a table has a Primary AMP on the join columns, the database can spool locally instead of using a hash redistribution, but stage 2 for the table is still needed.

2 Sort the rows into join column row hash sequence.
2 Duplication 1 Duplicate the smaller side on all AMPs.
2 Sort the rows into the row hash sequence of the join column.
3 Locally copy the other side and sort the rows into row hash sequence of the join column.
3 Index Matching 1 No redistribution is required if the primary indexes are the join columns and match.

Merge Join Examples

The following SELECT request determines who works in what location by merge joining the employee and department tables on the condition that the dept_no values in both tables are equal:

SELECT name, dept_name, loc
FROM employee INNER JOIN department
WHERE employee.dept_no = department.dept_no;

The following list presents the stages of processing this merge join.

  1. Because department rows are distributed according to the hash code for dept_no (the unique primary index of the department table), employee rows are themselves redistributed by the hash code of their own dept_no values.
  2. The redistributed employee rows are stored in a spool on each AMP and sorted by the hash value for dept_no.

    This puts the redistributed rows in the same order as the department rows on the AMP.

  3. The hash value of the first row from the department table is used to read the first row with the same or bigger row hash from the employee spool.

    That is, rows from either table of a merge join are skipped where possible.

  4. If there is a hash codes match, an additional test is performed to verify that the matched rows are equivalent in dept_no value as well as hash value.
  5. If there is no hash code match, then the larger of the two hash codes is used to position to the other table.

    The hash code and value comparisons continue until the end of one of the tables is reached.

  6. On each AMP the name, dept_no, and loc values from each of the qualifying rows are placed in a result spool.
  7. When the last AMP has completed its merge join, the contents of all result spools are merged and returned to the user.

When multiple rows fail to meet a constraint, the hash-match-reposition process may skip multiple rows. Skipping disqualified rows can speed up the merge join execution, especially if the tables are large.

The following graphic illustrates this merge join process:



The next merge join example uses the following table definitions:

Employee Table
e_num e_name dept
1 Brown 200
2 Smith 310
3 Jones 310
4 Clay 400
5 Peters 150
6 Foster 400
7 Gray 310
8 Baker 310

Column e_num is the UPI and PK of the table, and dept is a FK.

Department Table
dept dept_name
400 Education
150 Payroll
200 Finance
310 Mfg

Column dept is the UPI and PK of the table.

One of the merge join row redistribution methods is used if you perform the following SELECT request against these tables:

SELECT *
FROM employee INNER JOIN department
WHERE employee.dept = department.dept;

Merge Join Row Distribution (Hash Redistribution)

The following graphic shows a merge join row distribution using hash redistribution:


Merge join row distribution (hash redistribution)

Merge Join Row Distribution (Duplicate Table)

The following graphic shows a merge join row distribution by duplicating a table:


Merge join row distribution (duplicating a table)

This example uses the following employee and employee_phone table definitions:

Employee Table
e_num e_name
1 Brown
2 Smith
3 Jones
4 Clay
5 Peters
6 Foster
7 Gray
8 Baker

Column e_num is the UPI and PK of the table.

Employee_phone Table
e_num area_code phone
1 213 4950703
1 408 3628822
3 415 6347180
4 312 7463513
5 203 8337461
6 301 6675885
8 301 2641616
8 213 4950703

Column e_num is the NUPI and a FK of the table, and all three columns constitute the PK.

The matching indexes row distribution strategy is used if you perform the following SELECT request against these tables.

SELECT *
FROM employee INNER JOIN employee_phone
WHERE employee.e_num = employee_phone.e_num;