Join geography is a term that describes the relocation, if any, of rows required to perform a join. Remember that for two rows to be joined, they must be on the same AMP, so the Optimizer must determine the least costly row relocation strategy, and the relative cardinalities of the two relations in any join is the principal cost that must be considered. This is one of the main factors that compels having fresh statistics available for the Optimizer.
Join Distribution Strategies
- Join costs increase as a function of the number of rows that must be relocated, sorted, or both.
- Join plans for the same pairs of relations can change as a function of changes in the demographic properties of those relations.
- Hash-based redistribution of one or both relations in the join.
The EXPLAIN phrase to watch for is Redistribute.
- Duplication of one or both relations in the join.
The EXPLAIN phrase to watch for is Duplicate.
- Locally building a relation.
The EXPLAIN phrase to watch for is Locally Build.
- Randomly redistributing one of the relations.
After relocating a relation, the relation may need to be sorted. The EXPLAIN phrase to watch for is Sort.
There is also a fifth option, which is not to redistribute any rows. This option represents the degenerate case of row redistribution where no redistribution is required to make the join, and it occurs only when the primary [AMP] indexes of the two relations are such that equal-valued rows of both relations hash to the same AMPs.
Examples of Different Row Redistribution Strategies
The following set of examples demonstrates four different row redistribution strategies, three using merge join and a fourth using product join.
Redistributing the Rows of One Table for a Merge Join
This example illustrates a merge join strategy, which includes hash redistributing the rows of one table and sorting them by the row hash value of the join column. A relocation strategy is pursued because the join condition in the query is on only one of the primary indexes of the joined tables.
The query that generates a merge join is the following:
SELECT * FROM employee AS e INNER JOIN department AS d ON e.dept = d.dept;
enum (PK, UPI) |
name | dept (FK) |
---|---|---|
1 | Higa | 200 |
2 | Kostamaa | 310 |
3 | Chiang | 310 |
4 | Korlapati | 400 |
5 | Sinclair | 150 |
6 | Kaczmarek | 400 |
7 | Eggers | 310 |
8 | Challis | 310 |
dept (PK, UPI) |
name |
---|---|
150 | Payroll |
200 | Finance |
310 | Manufacturing |
400 | Engineering |
The example is run on a 4-AMP system. The query uses an equijoin condition on the dept columns of both tables. The system copies the employee table rows into spool and redistributes them on the row hash of e.dept. The merge join occurs after the rows to be joined have been relocated so they are on the same AMPs.
The relocation strategy occurs when one of the tables is already distributed on the join column row hash. The merge join is caused by the join column being the primary index of one (dept), but not both, of the tables.
Duplicating and Sorting the Rows of the Smaller Table on all AMPs, Building and Sorting a Local Copy of the Larger Table For a Merge Join
This example illustrates a different merge join strategy that consists of duplicating and sorting the smaller table in the join on all AMPs and locally building a copy of the larger table and sorting it on the employee table row hash.
This query and tables used for this example are the same as those used for Redistributing the Rows of One Table for a Merge Join, but the Optimizer pursues a different join geography strategy because the statistics for the tables are different. If the Optimizer determines from the available statistics that it would be less expensive to duplicate and sort the smaller table than to hash redistribute the larger table, it chooses the strategy followed by this scenario.
The system first duplicates the department table and then sorts it on the dept column for all AMPs. Next, the employee table is built locally and sorted on the row hash value of dept.
The final step in the process is to perform the actual merge join operation.
No Row Redistribution or Sorting for a Merge Join Because the Join Rows of Both Tables Are on the Same AMP
This example also illustrates a merge join strategy. This particular strategy does not require any duplication or sorting of rows because the joined rows of both tables hash to the same AMPs. This is a good example of using a NUPI for one table on the same domain as the UPI of the other table. This ensures that the rows having the same primary index values all hash to the same AMP. When the join condition is on those primary indexes, the rows to be joined are already on the same AMP, so no relocation or sorting need be done. All the system has to do is compare the rows that are already on the proper AMPs.
This example is run on a 4-AMP System.
The query that generates this merge join strategy is the following:
SELECT * FROM employee AS e INNER JOIN employee_phone AS p ON e.num = p.num;
The two tables to be joined are defined as follows:
enum (PK, FK, NUPI) |
name (PK) |
dept (PK) |
---|---|---|
1 | Higa | 200 |
2 | Kostamaa | 310 |
3 | Chiang | 310 |
4 | Korlapati | 400 |
5 | Sinclair | 150 |
6 | Kaczmarek | 400 |
7 | Eggers | 310 |
8 | Challis | 310 |
enum | area_code | phone |
---|---|---|
1 | 213 | 5551576 |
2 | 213 | 5550703 |
3 | 408 | 5558822 |
4 | 415 | 5557180 |
5 | 312 | 5553513 |
6 | 203 | 5557461 |
7 | 301 | 5555885 |
8 | 301 | 5551616 |
Duplicating the Smaller Table on Every AMP for a Product Join
This example illustrates a product join strategy, which includes duplicating the smaller table rows on every AMP of a 4-AMP system. The tables used for this example are the same as those used for Redistributing the Rows of One Table for a Merge Join.
The following query generates a product join:
SELECT * FROM employee AS e INNER JOIN department AS d ON e.dept > d.dept;
The product join is caused by the non-equijoin condition e.dept > d.dept (see Product Join). Based on the available statistics, the Optimizer determines that the department table is the smaller table in the join, so it devises a join plan that distributes copies of all those rows to each AMP. employee table rows, which are hash distributed on their UPI enum values, are not relocated. The product join operation returns only the rows that satisfy the specified join condition for the query after comparing every row in the employee table with every row in the duplicated department table.
Relocation Scenarios
The primary [AMP] index is the major consideration used by the Optimizer in determining how to join two tables and deciding which rows to relocate, though for tables that have no primary index, other criteria must be considered.
Three general scenarios can occur when two primary-indexed relations are joined using the merge join method (see Merge Join), as illustrated by the following table.
The following scenarios are ranked in order of best to worst, with 1 being best, where best means least costly:
Rank | Scenario | WHEN the join column set is … | FOR these relations in the join operation … |
---|---|---|---|
1 | 1 | the primary [AMP] index | both. |
2 | 2 | the primary [AMP] index | one. |
3 | 3 | not the primary [AMP] index | neither. |