Join geography describes the relocation, if any, of rows required to perform a join. Two rows to be joined must be on the same AMP, so the Optimizer must determine the least costly row relocation strategy, and the relative cardinalities of the two relations in any join is the principal cost that must be considered. This is one of the main factors that compels having fresh statistics available for the Optimizer.
Join Distribution Strategies
- Join costs increase as a function of the number of rows that must be relocated, sorted, or both.
- Join plans for the same pairs of relations can change as a function of changes in the demographic properties of those relations.
There are four possible join distribution strategies the Optimizer can choose among, and more than one strategy can be used for any join depending on circumstances. The four join distribution strategies are the following:
Join Distribution Strategy | EXPLAIN Phrase That Reflects Strategy |
---|---|
Hash-based redistribution of one or both relations in the join | Redistribute |
Duplication of one or both relations in the join | Duplicate |
Locally building a relation | Locally Build |
Randomly redistributing one of the relations | Sort |
A fifth option is not to redistribute any rows. This option represents the degenerate case of row redistribution where no redistribution is required to make the join, and occurs only when the primary [AMP] indexes of the two relations are such that equal-valued rows of both relations hash to the same AMPs.
Examples of Different Row Redistribution Strategies
The following set of examples demonstrates four different row redistribution strategies, three using merge join and a fourth using product join.
Redistributing the Rows of One Table for a Merge Join
This example illustrates a merge join strategy, which includes hash redistributing the rows of one table and sorting those rows by the row hash value of the join column. A relocation strategy is pursued because the join condition in the query is on only one of the primary indexes of the joined tables.
The query that generates a merge join is the following:
SELECT * FROM employee AS e INNER JOIN department AS d ON e.dept = d.dept;
enum (PK, UPI) |
name | dept (FK) |
---|---|---|
1 | Higa | 200 |
2 | Kostamaa | 310 |
3 | Chiang | 310 |
4 | Korlapati | 400 |
5 | Sinclair | 150 |
6 | Kaczmarek | 400 |
7 | Eggers | 310 |
8 | Challis | 310 |
dept (PK, UPI) |
name |
---|---|
150 | Payroll |
200 | Finance |
310 | Manufacturing |
400 | Engineering |
The example is run on a 4-AMP system. The query uses an equijoin condition on the dept columns of both tables. The system copies the employee table rows into spool and redistributes those rows on the row hash of e.dept. The merge join occurs after the rows to be joined have been relocated to the same AMPs.
The relocation strategy occurs when one of the tables is already distributed on the join column row hash. The merge join is caused by the join column being the primary index of one (dept), but not both, of the tables.
Duplicating and Sorting Smaller Table on All AMPs, Building and Sorting Local Copy of Larger Table for Merge Join
This example illustrates a different merge join strategy that consists of duplicating and sorting the smaller table in the join on all AMPs and locally building a copy of the larger table and sorting the copy on the employee table row hash.
This query and tables used for this example are the same as those used for Redistributing the Rows of One Table for a Merge Join, but the Optimizer pursues a different join geography strategy, because the statistics for the tables are different. If the Optimizer determines from the available statistics that duplicating and sort the smaller table is less expensive than hash-redistributing the larger table, the Optimizer chooses the strategy followed by this scenario.
The system duplicates the department table and sorts on the dept column for all AMPs, then builds the employee table locally and sorted on the row hash value of dept.
The final step in the process is to perform the actual merge join operation.
No Row Redistribution or Sorting for a Merge Join Because the Join Rows of Both Tables Are on the Same AMP
This example illustrates a merge join strategy that requires no duplication or sorting, because the joined rows of both tables hash to the same AMPs. This is a good example of using a NUPI for one table on the same domain as the UPI of the other table. This practice makes sure that rows with the same primary index values hash to the same AMP. When the join condition is on those primary indexes, the rows to join are already on the same AMP. The system only has to compare the rows that are already on the proper AMPs.
This example is run on a 4-AMP System.
The query that generates this merge join strategy is the following:
SELECT * FROM employee AS e INNER JOIN employee_phone AS p ON e.num = p.num;
The two tables to be joined are defined as follows:
enum (PK, FK, NUPI) |
name (PK) |
dept (PK) |
---|---|---|
1 | Higa | 200 |
2 | Kostamaa | 310 |
3 | Chiang | 310 |
4 | Korlapati | 400 |
5 | Sinclair | 150 |
6 | Kaczmarek | 400 |
7 | Eggers | 310 |
8 | Challis | 310 |
enum | area_code | phone |
---|---|---|
1 | 213 | 5551576 |
2 | 213 | 5550703 |
3 | 408 | 5558822 |
4 | 415 | 5557180 |
5 | 312 | 5553513 |
6 | 203 | 5557461 |
7 | 301 | 5555885 |
8 | 301 | 5551616 |
Duplicating the Smaller Table on Every AMP for a Product Join
This example illustrates a product join strategy, which includes duplicating the smaller table rows on every AMP of a 4-AMP system. The tables used for this example are the same as those used for Redistributing the Rows of One Table for a Merge Join.
The following query generates a product join:
SELECT * FROM employee AS e INNER JOIN department AS d ON e.dept > d.dept;
The product join is caused by the non-equijoin condition e.dept > d.dept (see Product Join). Based on the available statistics, the Optimizer determines that the department table is the smaller table in the join, and therefore devises a join plan that distributes copies of all those rows to each AMP. employee table rows, which are hash distributed on their UPI enum values, are not relocated. The product join operation returns only the rows that satisfy the specified join condition for the query after comparing every row in the employee table with every row in the duplicated department table.
Relocation Scenarios
The primary [AMP] index is the major consideration used by the Optimizer in determining how to join two tables and deciding which rows to relocate, though for tables that have no primary index, other criteria must be considered.
Three general scenarios can occur when two primary-indexed relations are joined using the merge join method (see Merge Join), as illustrated by the following table.
The following scenarios are ranked in order of best to worst, with 1 being best, where best means least costly:
Rank | Scenario | Join Column Set | Relations in Join Operation |
---|---|---|---|
1 | 1 | Primary [AMP] index | Both |
2 | 2 | Primary [AMP] index | One |
3 | 3 | Not primary [AMP] index | Neither |