Join Strategies and Methods | Join Planning/Optimization | Teradata Vantage - Join Strategies and Methods - Advanced SQL Engine - Teradata Database

SQL Request and Transaction Processing

Product
Advanced SQL Engine
Teradata Database
Release Number
17.05
17.00
Published
June 2020
Language
English (United States)
Last Update
2021-01-24
dita:mapPath
ykx1561500561173.ditamap
dita:ditavalPath
lze1555437562152.ditaval
dita:id
B035-1142
lifecycle
previous
Product Category
Teradata Vantage™

The Optimizer has several strategies for joining tables, including those based on factors such as join geography and join order. The Optimizer also has many join methods, or modes, to choose from to ensure that any join operation is fully optimized. A cross section of the join methods available to the Teradata Optimizer is described in the sections that follow. Examples of join methods include product joins, merge joins, and hash joins.

The particular processing described for a given join method (for example, duplication or redistribution of spooled data) might not apply to all join methods.

Guidelines for Helping to Optimize Join Operations

The following procedures are key factors for optimizing your SQL queries:
  • Collect statistics regularly on all your regular join columns. Accurate table statistics are an absolute must if the Optimizer is to consistently choose the best join plan for a query.

    For more information on the Optimizer form of the COLLECT STATISTICS statement, see Teradata Vantage™ - SQL Data Definition Language Syntax and Examples, B035-1144.

  • To obtain an accurate description of the processing that will be performed for a particular join, always submit an EXPLAIN request modifier for the query containing the join expression. You can often reformulate a query in such a way that resource usage is more highly optimized.

    For more information on the EXPLAIN request modifier, see Teradata Vantage™ - SQL Data Manipulation Language, B035-1146.

Summary of the Most Commonly Used Join Algorithms

The following table summarizes some of the major characteristics of the most commonly used join algorithms:

Join Method Important Properties
Product
  • Always selected by the Optimizer for WHERE clause inequality conditions.
  • High cost because of the number of comparisons required.
  • Merge
  • Hash
  • When done on matching primary [AMP] indexes, does not require any data to be redistributed.
  • Hash joins are often better performers and are used whenever possible. They can be used for equijoins only.
Nested
  • Only join expression that generally does not require all AMPs.
  • Preferred join expression for OLTP applications.

Strategies for Joining Skewed Tables on an Equality Join Condition

When one or both tables in a join contain skewed values, the performance of the join operation is frequently degraded. Skew on certain values refers to the variation in the number of rows that contain those values among the various AMPs. If the variation is high, meaning that some AMPs have a high number of rows with those values and some AMPs have far fewer rows with the values, the values are described as being skewed and the base table is said to be skewed on these values.

For example, consider the 2 tables product and sales, which are joined on the condition product.product_id = sales.product_id, and column sales.product_id is skewed on the value 1.

The classic join strategies pursue various geographic methods such as those listed below, where the relative sizes of the individual tables play a major role in selecting the optimal strategy. For example, if the table being duplicated in the second or third plan is large or if both the tables are large, any plan chosen from the list can degrade performance.
  • Redistribute product on product.product_id and redistribute sales on sales.product_id.
  • Duplicate product and access sales locally/directly.
  • Access product locally/directly and duplicate sales.
A partial redistribution partial duplication (PRPD) join strategy helps to minimize the impact of skew on join performance by, for example, making the join between sales and product as two separate joins.
  • For the first join, PRPD keeps the rows with the skewed value 1 of sales.product_id locally on each AMP and duplicates rows from product.product_id that match the skewed value of sales.product_id to all AMPs and then joins only those rows.
  • For the second join, PRPD joins the non-skewed rows from the sales table and the rest of the rows from the product table using whatever join method the Optimizer determines to be best.

The database combines the results of these two joins as the final join result.

The process of dividing the rows in a single source into several subparts is referred to as a split. Using the PRPD join strategy, the database splits both the sales and product tables into two relations that participate in two regular joins. For the subparts of the two tables with skewed values, the Optimizer also selects the best join plan based on the costs of the different join plans, so the geographies are set according to the most optimal join plan.

The preceding case is just one possible example of PRPD when there is a single skewed value in a single column of one table. The Optimizer can also use PRPD when there are multiple skewed values and multiple join columns with skew in one or both of the tables. When the set of join conditions is on expressions of base table columns and statistics have been collected on the expressions, the Optimizer can also use PRPD to join skewed tables.

For the preceding example, the Optimizer selects a local geography for the skewed rows in sales and a duplication geography for the rows with skewed values in product. These geographies are not fixed in PRPD.

PRPD requires the same demographic support as any other join operation, such as accurate statistics, which it uses to determine the list of skewed values. PRPD uses skew detection logic to update existing statistics if the Optimizer determines that is necessary. The Optimizer selects a join plan using PRPD only if it is cheaper than other join methods. When both relations being joined are skewed on the same value, the selection of which relation is the skewed relation depends on the number of rows with skewed values and the row size of both relations. The Optimizer selects the relation with a higher cost based on those factors as the skewed relation.

The main challenge the Optimizer faces when it determines whether to use PRPD for a join is to detect the skewed values and their frequencies because the single-table conditions and previous joins, if any, might filter out some skewed values and alter the frequency of the surviving skewed values. The Optimizer does not always have accurate skewed value information for PRPD planning.

PRPD is designed to operate in one of 2 modes, depending on an internal parameter setting.

Mode PRPD plan is considered by the Optimizer
normal only when the surviving skewed values and their frequencies can be determined with some confidence.

This is the default setting.

aggressive even when there is no proper information about the surviving skewed values.

The skewed values are derived based on heuristics.

The following examples clarify when the Optimizer can try a PRPD plan with join operations.

Consider the following 3 tables for the example set:
  • t1(x1, y1, z1), primary index defined on (x1)
  • t2(x2, y2, z2), primary index defined on (x2)
  • t3(x3, y3, z3), primary index defined on (x3)
In the following examples, when it is said that a relation qualifies for PRPD, it means that the following items are true:
  • Statistics are collected on the hashed join column set.
  • The demographics of the join column set satisfy a set of conditions that are determined by an internal parameter setting.

Normal PRPD mode is assumed by default. If an example applies to aggressive mode only, that is explicitly stated.

Example: PRPD for Multiple Skewed Tables

In this example, if either t1 or t2 qualifies for PRPD, the Optimizer tries a PRPD plan by considering t1 or t2 as the skewed relation. In this case, there are two partial joins.

If both tables t1 and t2 are skewed, the Optimizer tries a PRPD plan using 3 joins.

SELECT *
FROM t1, t2
WHERE t1.y1 = t2.y2;

Example: PRPD When Derived Statistics Can Identify Surviving Skewed Values

In this example, the derived statistics logic can find the surviving skewed values from the histogram with some confidence so t1 qualifies for PRPD if statistics have been collected on y1.

SELECT *
FROM t1,t2
WHERE t1.y1 = t2.y2
AND   t1.y1 > 5;

Example: PRPD for Mixed Single-Column and Multicolumn Statistics on Skewed Values

In this example, the derived statistics logic can find the range of surviving values for t1.y1 after applying the single-table condition, so t1 qualifies for PRPD if statistics have been collected on (t1.y1) and (t1.z1, t1.y1).

If multicolumn statistics on (t1.z1, t1.y1) have not been collected, the Optimizer cannot find the surviving skewed values on column t1.y1 after it applies the single-table condition so t1 does not qualify for PRPD in normal mode. For this case, the Optimizer considers PRPD only if the internal parameters for PRPD are set for aggressive mode.

SELECT *
FROM t1,t2
WHERE t1.y1 = t2.y2
AND   t1.z1 > 5;

There are other scenarios as well, where the Optimizer can find the surviving skewed values on the join column with some confidence when there are single-table conditions. An example is when there is a sparse join index that covers the table and statistics have been collected on the join column in the join index. For this example, the join can qualify for PRPD if there is a join index with the following definition and statistics are collected on j1.y1, t1 can qualify for PRPD.

CREATE JOIN INDEX j1 AS
  SELECT *
  FROM t1
  WHERE z1 > 5;

Example: PRPD for Skewed Expression Statistics

The Optimizer evaluates t1 for PRPD only when statistics on the expression (t1.y1 + t1.z1) are available. If the Optimizer determines that t1 qualifies for PRPD based on (t1.y1 + t1.z1) expression statistics, it tries a PRPD plan similar to Example: PRPD for Multiple Skewed Tables.

SELECT *
FROM t1,t2
WHERE t1.y1 + t1.z1 = t2.y2;

Example: PRPD for Skewed Single Column Statistics

Suppose t1.y1 is skewed, statistics have been collected on t1.y1, and the first join (call it R4) is between tables t1 and t3. After the first join, the Optimizer cannot determine which skewed values from y1 survived the join condition t1.x1 = t3.x3, so for the R4 X R2 join, R4 does not qualify for PRPD.

For this case, the Optimizer considers PRPD only if the internal parameters for PRPD are set for aggressive mode.

SELECT *
FROM t1,t2, t3
WHERE t1.y1 = t2.y2
AND   t1.x1 = t3.x3;

Example: PRPD for Multicolumn Sets Skewed on Some Values

Suppose (t1.y1, t1.z1) and (t2.y2, t2.z2) are skewed on some values and you have collected multicolumn statistics on those column sets.

To process this request, the Optimizer normally picks an inclusion join with a pre-join sort that removes duplicates on (t1.y1, t1.z1). If there is skew on (t1.y1, t1.z1), the pre-join sort removes the duplicates and removes the skew as well. Because of this, the Optimizer does not evaluate a PRPD plan by considering t1 to be a skewed table. However, if there is skew on (t2.y2, t2.z2), the Optimizer does consider a PRPD plan.

SELECT *
FROM t2
WHERE t2.y2 IN (SELECT t1.y1
                FROM t1
                WHERE t2.z2 = t1.z1);

For More Information About Joins

For more information about product joins, see Product Join.

For more information about hash joins, see Hash Join.

For more information about rowID joins, see RowID Join.

For further information about column-partitioned tables and join indexes, see “CREATE TABLE” and “CREATE JOIN INDEX” in Teradata Vantage™ - SQL Data Definition Language Detailed Topics , B035-1184 , and the information about primary indexes and hash and join indexes in Teradata Vantage™ - Database Design, B035-1094.