Join Geography | Join Planning/Optimization | Teradata Vantage - 17.10 - Join Geography - Advanced SQL Engine - Teradata Database

Teradata Vantage™ - SQL Request and Transaction Processing

Product
Advanced SQL Engine
Teradata Database
Release Number
17.10
Release Date
July 2021
Content Type
Programming Reference
User Guide
Publication ID
B035-1142-171K
Language
English (United States)

Join geography is a term that describes the relocation, if any, of rows required to perform a join. Remember that for two rows to be joined, they must be on the same AMP, so the Optimizer must determine the least costly row relocation strategy, and the relative cardinalities of the two relations in any join is the principal cost that must be considered. This is one of the main factors that compels having fresh statistics available for the Optimizer.

Join Distribution Strategies

The method the Optimizer chooses to use to distribute rows that are to be joined depends on several factors, but primarily on the availability of indexes and statistics. Keep the following facts about joins in mind:
  • Join costs increase as a function of the number of rows that must be relocated, sorted, or both.
  • Join plans for the same pairs of relations can change as a function of changes in the demographic properties of those relations.
There are four possible join distribution strategies the Optimizer can choose among, and more than one strategy can be used for any join depending on circumstances. The four join distribution strategies are the following:
  • Hash-based redistribution of one or both relations in the join.

    The EXPLAIN phrase to watch for is Redistribute.

  • Duplication of one or both relations in the join.

    The EXPLAIN phrase to watch for is Duplicate.

  • Locally building a relation.

    The EXPLAIN phrase to watch for is Locally Build.

  • Randomly redistributing one of the relations.

After relocating a relation, the relation may need to be sorted. The EXPLAIN phrase to watch for is Sort.

There is also a fifth option, which is not to redistribute any rows. This option represents the degenerate case of row redistribution where no redistribution is required to make the join, and it occurs only when the primary [AMP] indexes of the two relations are such that equal-valued rows of both relations hash to the same AMPs.

Examples of Different Row Redistribution Strategies

The following set of examples demonstrates four different row redistribution strategies, three using merge join and a fourth using product join.

Redistributing the Rows of One Table for a Merge Join

This example illustrates a merge join strategy, which includes hash redistributing the rows of one table and sorting them by the row hash value of the join column. A relocation strategy is pursued because the join condition in the query is on only one of the primary indexes of the joined tables.

The query that generates a merge join is the following:

SELECT *
FROM employee AS e INNER JOIN department AS d
ON e.dept = d.dept;
The tables below show the columns and data of the two tables to be joined.
Employee Table
enum

(PK, UPI)

name dept

(FK)

1 Higa 200
2 Kostamaa 310
3 Chiang 310
4 Korlapati 400
5 Sinclair 150
6 Kaczmarek 400
7 Eggers 310
8 Challis 310
Department Table
dept

(PK, UPI)

name
        150 Payroll
        200 Finance
        310 Manufacturing
        400 Engineering
The following graphic shows how individual rows would be relocated to make this join for a 4-AMP system:

Relocating individual rows to join a 4-AMP system

The example is run on a 4-AMP system. The query uses an equijoin condition on the dept columns of both tables. The system copies the employee table rows into spool and redistributes them on the row hash of e.dept. The merge join occurs after the rows to be joined have been relocated so they are on the same AMPs.

The relocation strategy occurs when one of the tables is already distributed on the join column row hash. The merge join is caused by the join column being the primary index of one (dept), but not both, of the tables.

Duplicating and Sorting the Rows of the Smaller Table on all AMPs, Building and Sorting a Local Copy of the Larger Table For a Merge Join

This example illustrates a different merge join strategy that consists of duplicating and sorting the smaller table in the join on all AMPs and locally building a copy of the larger table and sorting it on the employee table row hash.

This query and tables used for this example are the same as those used for Redistributing the Rows of One Table for a Merge Join, but the Optimizer pursues a different join geography strategy because the statistics for the tables are different. If the Optimizer determines from the available statistics that it would be less expensive to duplicate and sort the smaller table than to hash redistribute the larger table, it chooses the strategy followed by this scenario.

The following graphic shows how individual rows would be relocated to make this join for a 4-AMP system

Relocating individual rows to join a 4-AMP system

The system first duplicates the department table and then sorts it on the dept column for all AMPs. Next, the employee table is built locally and sorted on the row hash value of dept.

The final step in the process is to perform the actual merge join operation.

No Row Redistribution or Sorting for a Merge Join Because the Join Rows of Both Tables Are on the Same AMP

This example also illustrates a merge join strategy. This particular strategy does not require any duplication or sorting of rows because the joined rows of both tables hash to the same AMPs. This is a good example of using a NUPI for one table on the same domain as the UPI of the other table. This ensures that the rows having the same primary index values all hash to the same AMP. When the join condition is on those primary indexes, the rows to be joined are already on the same AMP, so no relocation or sorting need be done. All the system has to do is compare the rows that are already on the proper AMPs.

This example is run on a 4-AMP System.

The query that generates this merge join strategy is the following:

SELECT *
FROM employee AS e INNER JOIN employee_phone AS p
ON e.num = p.num;

The two tables to be joined are defined as follows:

Employee Table
enum

(PK, FK, NUPI)

name

(PK)

dept

(PK)

1 Higa 200
2 Kostamaa 310
3 Chiang 310
4 Korlapati 400
5 Sinclair 150
6 Kaczmarek 400
7 Eggers 310
8 Challis 310
Employee_Phone Table
enum area_code phone
1 213 5551576
2 213 5550703
3 408 5558822
4 415 5557180
5 312 5553513
6 203 5557461
7 301 5555885
8 301 5551616
The following graphic shows how individual rows would be relocated to make this join for a 4-AMP system:

Relocating individual rows to join a 4-AMP system

Duplicating the Smaller Table on Every AMP for a Product Join

This example illustrates a product join strategy, which includes duplicating the smaller table rows on every AMP of a 4-AMP system. The tables used for this example are the same as those used for Redistributing the Rows of One Table for a Merge Join.

The following query generates a product join:

SELECT *
FROM employee AS e INNER JOIN department AS d
ON e.dept > d.dept;

The product join is caused by the non-equijoin condition e.dept > d.dept (see Product Join). Based on the available statistics, the Optimizer determines that the department table is the smaller table in the join, so it devises a join plan that distributes copies of all those rows to each AMP. employee table rows, which are hash distributed on their UPI enum values, are not relocated. The product join operation returns only the rows that satisfy the specified join condition for the query after comparing every row in the employee table with every row in the duplicated department table.

The following graphic shows how individual rows are relocated to make this join for the 4-AMP system.

Relocating individual rows to join a 4-AMP system

Relocation Scenarios

The primary [AMP] index is the major consideration used by the Optimizer in determining how to join two tables and deciding which rows to relocate, though for tables that have no primary index, other criteria must be considered.

Three general scenarios can occur when two primary-indexed relations are joined using the merge join method (see Merge Join), as illustrated by the following table.

The following scenarios are ranked in order of best to worst, with 1 being best, where best means least costly:

Rank Scenario WHEN the join column set is … FOR these relations in the join operation …
   1        1 the primary [AMP] index both.
   2        2 the primary [AMP] index one.
   3        3 not the primary [AMP] index neither.