Join Geography | Join Planning/Optimization | Teradata Vantage - Join Geography - Analytics Database - Teradata Vantage

SQL Request and Transaction Processing

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Analytics Database
Teradata Vantage
Release Number
17.20
Published
June 2022
Language
English (United States)
Last Update
2024-10-04
dita:mapPath
zfm1628111633230.ditamap
dita:ditavalPath
qkf1628213546010.ditaval
dita:id
evd1472255317510
lifecycle
latest
Product Category
Teradata Vantageā„¢

Join geography describes the relocation, if any, of rows required to perform a join. Two rows to be joined must be on the same AMP, so the Optimizer must determine the least costly row relocation strategy, and the relative cardinalities of the two relations in any join is the principal cost that must be considered. This is one of the main factors that compels having fresh statistics available for the Optimizer.

Join Distribution Strategies

The method the Optimizer chooses to use to distribute rows that are to be joined depends on multiple factors, but primarily on the availability of indexes and statistics. Keep the following facts about joins in mind:
  • Join costs increase as a function of the number of rows that must be relocated, sorted, or both.
  • Join plans for the same pairs of relations can change as a function of changes in the demographic properties of those relations.

There are four possible join distribution strategies the Optimizer can choose among, and more than one strategy can be used for any join depending on circumstances. The four join distribution strategies are the following:

Join Distribution Strategy EXPLAIN Phrase That Reflects Strategy
Hash-based redistribution of one or both relations in the join Redistribute
Duplication of one or both relations in the join Duplicate
Locally building a relation Locally Build
Randomly redistributing one of the relations Sort

A fifth option is not to redistribute any rows. This option represents the degenerate case of row redistribution where no redistribution is required to make the join, and occurs only when the primary [AMP] indexes of the two relations are such that equal-valued rows of both relations hash to the same AMPs.

Examples of Different Row Redistribution Strategies

The following set of examples demonstrates four different row redistribution strategies, three using merge join and a fourth using product join.

Redistributing the Rows of One Table for a Merge Join

This example illustrates a merge join strategy, which includes hash redistributing the rows of one table and sorting those rows by the row hash value of the join column. A relocation strategy is pursued because the join condition in the query is on only one of the primary indexes of the joined tables.

The query that generates a merge join is the following:

SELECT *
FROM employee AS e INNER JOIN department AS d
ON e.dept = d.dept;
The following tables show the columns and data of the two tables to be joined.
Employee Table
enum

(PK, UPI)

name dept

(FK)

1 Higa 200
2 Kostamaa 310
3 Chiang 310
4 Korlapati 400
5 Sinclair 150
6 Kaczmarek 400
7 Eggers 310
8 Challis 310
Department Table
dept

(PK, UPI)

name
150 Payroll
200 Finance
310 Manufacturing
400 Engineering
The following graphic shows how individual rows are relocated to make this join for a 4-AMP system:


The example is run on a 4-AMP system. The query uses an equijoin condition on the dept columns of both tables. The system copies the employee table rows into spool and redistributes those rows on the row hash of e.dept. The merge join occurs after the rows to be joined have been relocated to the same AMPs.

The relocation strategy occurs when one of the tables is already distributed on the join column row hash. The merge join is caused by the join column being the primary index of one (dept), but not both, of the tables.

Duplicating and Sorting Smaller Table on All AMPs, Building and Sorting Local Copy of Larger Table for Merge Join

This example illustrates a different merge join strategy that consists of duplicating and sorting the smaller table in the join on all AMPs and locally building a copy of the larger table and sorting the copy on the employee table row hash.

This query and tables used for this example are the same as those used for Redistributing the Rows of One Table for a Merge Join, but the Optimizer pursues a different join geography strategy, because the statistics for the tables are different. If the Optimizer determines from the available statistics that duplicating and sort the smaller table is less expensive than hash-redistributing the larger table, the Optimizer chooses the strategy followed by this scenario.

The following graphic shows how individual rows are relocated to make this join for a 4-AMP system

Relocating individual rows to join a 4-AMP system

The system duplicates the department table and sorts on the dept column for all AMPs, then builds the employee table locally and sorted on the row hash value of dept.

The final step in the process is to perform the actual merge join operation.

No Row Redistribution or Sorting for a Merge Join Because the Join Rows of Both Tables Are on the Same AMP

This example illustrates a merge join strategy that requires no duplication or sorting, because the joined rows of both tables hash to the same AMPs. This is a good example of using a NUPI for one table on the same domain as the UPI of the other table. This practice makes sure that rows with the same primary index values hash to the same AMP. When the join condition is on those primary indexes, the rows to join are already on the same AMP. The system only has to compare the rows that are already on the proper AMPs.

This example is run on a 4-AMP System.

The query that generates this merge join strategy is the following:

SELECT *
FROM employee AS e INNER JOIN employee_phone AS p
ON e.num = p.num;

The two tables to be joined are defined as follows:

Employee Table
enum

(PK, FK, NUPI)

name

(PK)

dept

(PK)

1 Higa 200
2 Kostamaa 310
3 Chiang 310
4 Korlapati 400
5 Sinclair 150
6 Kaczmarek 400
7 Eggers 310
8 Challis 310
Employee_Phone Table
enum area_code phone
1 213 5551576
2 213 5550703
3 408 5558822
4 415 5557180
5 312 5553513
6 203 5557461
7 301 5555885
8 301 5551616
The following graphic shows how individual rows are relocated to make this join for a 4-AMP system:

Relocating individual rows to join a 4-AMP system

Duplicating the Smaller Table on Every AMP for a Product Join

This example illustrates a product join strategy, which includes duplicating the smaller table rows on every AMP of a 4-AMP system. The tables used for this example are the same as those used for Redistributing the Rows of One Table for a Merge Join.

The following query generates a product join:

SELECT *
FROM employee AS e INNER JOIN department AS d
ON e.dept > d.dept;

The product join is caused by the non-equijoin condition e.dept > d.dept (see Product Join). Based on the available statistics, the Optimizer determines that the department table is the smaller table in the join, and therefore devises a join plan that distributes copies of all those rows to each AMP. employee table rows, which are hash distributed on their UPI enum values, are not relocated. The product join operation returns only the rows that satisfy the specified join condition for the query after comparing every row in the employee table with every row in the duplicated department table.

The following graphic shows how individual rows are relocated to make this join for the 4-AMP system.

Relocating individual rows to join a 4-AMP system

Relocation Scenarios

The primary [AMP] index is the major consideration used by the Optimizer in determining how to join two tables and deciding which rows to relocate, though for tables that have no primary index, other criteria must be considered.

Three general scenarios can occur when two primary-indexed relations are joined using the merge join method (see Merge Join), as illustrated by the following table.

The following scenarios are ranked in order of best to worst, with 1 being best, where best means least costly:

Rank Scenario Join Column Set Relations in Join Operation
1 1 Primary [AMP] index Both
2 2 Primary [AMP] index One
3 3 Not primary [AMP] index Neither