Join Strategies and Methods | Join Planning/Optimization | Teradata Vantage - Join Strategies and Methods - Analytics Database - Teradata Vantage

SQL Request and Transaction Processing

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Analytics Database
Teradata Vantage
Release Number
17.20
Published
June 2022
Language
English (United States)
Last Update
2024-10-04
dita:mapPath
zfm1628111633230.ditamap
dita:ditavalPath
qkf1628213546010.ditaval
dita:id
evd1472255317510
lifecycle
latest
Product Category
Teradata Vantage™

The Optimizer has multiple strategies for joining tables, including those based on join geography and join order. The Optimizer also has multiple join methods, or modes, to make sure any join operation is fully optimized. A cross section of the join methods available to the Teradata Optimizer is described in the sections that follow. Examples of join methods are product joins, merge joins, and hash joins.

The particular processing described for a given join method (for example, duplication or redistribution of spooled data) may not apply to all join methods.

Guidelines for Helping to Optimize Join Operations

The following procedures are key factors for optimizing your SQL queries:
  • Collect statistics regularly on all your regular join columns. Accurate table statistics are an absolute must if the Optimizer is to consistently choose the best join plan for a query.

    See Teradata Vantage™ - SQL Data Definition Language Syntax and Examples, B035-1144.

  • To obtain an accurate description of the processing that is to be performed for a particular join, always submit an EXPLAIN request modifier for the query containing the join expression. You can often reformulate a query in such a way that resource usage is more highly optimized.

    See Teradata Vantage™ - SQL Data Manipulation Language, B035-1146.

Summary of the Most Commonly Used Join Algorithms

The following table summarizes the major characteristics of the most commonly used join algorithms:

Join Method Important Properties
Product
  • Always selected by the Optimizer for WHERE clause inequality conditions.
  • High cost because of the number of comparisons required.
  • Merge
  • Hash
  • When done on matching primary [AMP] indexes, does not require any data to be redistributed.
  • Hash joins are often better performers and are used whenever possible. Hash joins can be used for equijoins only.
Nested
  • Only join expression that generally does not require all AMPs.
  • Preferred join expression for OLTP applications.

Strategies for Joining Skewed Tables on an Equality Join Condition

When one or both tables in a join contain skewed values, the performance of the join operation is frequently degraded. Skew on values refers to the variation in the number of rows that contain those values among the AMPs. If the variation is high, meaning that different AMPs have significantly different number of rows with those values, the values are described as being skewed and the base table is said to be skewed on these values.

For example, consider the tables product and sales, which are joined on the condition product.product_id = sales.product_id, and column sales.product_id is skewed on the value 1.

The classic join strategies pursue geographic methods such as those listed here, where the relative sizes of the individual tables play a major role in selecting the optimal strategy. For example, if the table being duplicated in the second or third plan is large or if both the tables are large, any plan chosen from the list can degrade performance.
  • Redistribute product on product.product_id and redistribute sales on sales.product_id.
  • Duplicate product and access sales locally/directly.
  • Access product locally/directly and duplicate sales.
A partial redistribution partial duplication (PRPD) join strategy helps to minimize the impact of skew on join performance by, for example, making the join between sales and product as two separate joins.
  • For the first join, PRPD keeps the rows with the skewed value 1 of sales.product_id locally on each AMP and duplicates rows from product.product_id that match the skewed value of sales.product_id to all AMPs and then joins only those rows.
  • For the second join, PRPD joins the non-skewed rows from the sales table and the rest of the rows from the product table using whatever join method the Optimizer determines to be best.

The database combines the results of these two joins as the final join result.

The process of dividing the rows in a single source into subparts is called a split. Using the PRPD join strategy, the database splits the sales and product tables into two relations that participate in two regular joins. For the subparts of the two tables with skewed values, the Optimizer also selects the best join plan based on the costs of the different join plans, so the geographies are set according to the most optimal join plan.

The preceding case is just one possible example of PRPD when there is a single skewed value in a single column of one table. The Optimizer can also use PRPD when there are multiple skewed values and multiple join columns with skew in one or both of the tables. When the set of join conditions is on expressions of base table columns and statistics have been collected on the expressions, the Optimizer can also use PRPD to join skewed tables.

For the preceding example, the Optimizer selects a local geography for the skewed rows in sales and a duplication geography for the rows with skewed values in product. These geographies are not fixed in PRPD.

PRPD requires accurate statistics to determine the list of skewed values. PRPD uses skew detection logic to update existing statistics if the Optimizer determines that is necessary. The Optimizer selects a join plan using PRPD only if that is cheaper than other join methods. When both relations being joined are skewed on the same value, the selection of skewed relation depends on the number of rows with skewed values and the row size of the relations. The Optimizer selects the relation with a higher cost based on those factors as the skewed relation.

The main challenge the Optimizer faces when determining whether to use PRPD for a join is to detect the skewed values and their frequencies, because the single-table conditions and previous joins, if any, may filter out skewed values and alter the frequency of the surviving skewed values. The Optimizer does not always have accurate skewed value information for PRPD planning.

PRPD is designed to operate in one of 2 modes, depending on an internal parameter setting.

Mode When Optimizer Considers PRPD Plan
Normal When the surviving skewed values and their frequencies can be determined with confidence.

This is the default setting.

Aggressive When there is no proper information about the surviving skewed values.

The skewed values are derived based on heuristics.

The following examples clarify when the Optimizer can try a PRPD plan with join operations.

Consider the following 3 tables for the example set:
  • t1(x1, y1, z1), primary index defined on (x1)
  • t2(x2, y2, z2), primary index defined on (x2)
  • t3(x3, y3, z3), primary index defined on (x3)
In the following examples, saying a relation qualifies for PRPD means the following items are true:
  • Statistics are collected on the hashed join column set.
  • The demographics of the join column set satisfy a set of conditions that are determined by an internal parameter setting.

Normal PRPD mode is assumed by default. If an example applies to aggressive mode only, that is explicitly stated.

Example: PRPD for Multiple Skewed Tables

In this example, if either t1 or t2 qualifies for PRPD, the Optimizer tries a PRPD plan by considering t1 or t2 as the skewed relation. In this case, there are two partial joins.

If both tables t1 and t2 are skewed, the Optimizer tries a PRPD plan using 3 joins.

SELECT *
FROM t1, t2
WHERE t1.y1 = t2.y2;

Example: PRPD When Derived Statistics Can Identify Surviving Skewed Values

In this example, the derived statistics logic can find the surviving skewed values from the histogram with confidence so t1 qualifies for PRPD if statistics have been collected on y1.

SELECT *
FROM t1,t2
WHERE t1.y1 = t2.y2
AND   t1.y1 > 5;

Example: PRPD for Mixed Single-Column and Multicolumn Statistics on Skewed Values

In this example, the derived statistics logic can find the range of surviving values for t1.y1 after applying the single-table condition, so t1 qualifies for PRPD if statistics have been collected on (t1.y1) and (t1.z1, t1.y1).

If multicolumn statistics on (t1.z1, t1.y1) have not been collected, the Optimizer cannot find the surviving skewed values on column t1.y1 after applying the single-table condition so t1 does not qualify for PRPD in normal mode. For this case, the Optimizer considers PRPD only if the internal parameters for PRPD are set for aggressive mode.

SELECT *
FROM t1,t2
WHERE t1.y1 = t2.y2
AND   t1.z1 > 5;

There are other scenarios where the Optimizer can find the surviving skewed values on the join column with confidence when there are single-table conditions. An example is when there is a sparse join index that covers the table and statistics have been collected on the join column in the join index. For this example, the join can qualify for PRPD if there is a join index with the following definition and statistics are collected on j1.y1, t1 can qualify for PRPD.

CREATE JOIN INDEX j1 AS
  SELECT *
  FROM t1
  WHERE z1 > 5;

Example: PRPD for Skewed Expression Statistics

The Optimizer evaluates t1 for PRPD only when statistics on the expression (t1.y1 + t1.z1) are available. If the Optimizer determines that t1 qualifies for PRPD based on (t1.y1 + t1.z1) expression statistics, the Optimizer tries a PRPD plan similar to Example: PRPD for Multiple Skewed Tables.

SELECT *
FROM t1,t2
WHERE t1.y1 + t1.z1 = t2.y2;

Example: PRPD for Skewed Single Column Statistics

Suppose t1.y1 is skewed, statistics have been collected on t1.y1, and the first join (called R4) is between tables t1 and t3. After the first join, the Optimizer cannot determine which skewed values from y1 survived the join condition t1.x1 = t3.x3, so for the R4 X R2 join, R4 does not qualify for PRPD.

For this case, the Optimizer considers PRPD only if the internal parameters for PRPD are set for aggressive mode.

SELECT *
FROM t1,t2, t3
WHERE t1.y1 = t2.y2
AND   t1.x1 = t3.x3;

Example: PRPD for Multicolumn Sets with Skewed Values

Suppose (t1.y1, t1.z1) and (t2.y2, t2.z2) have skewed values and you have collected multicolumn statistics on those column sets.

To process this request, the Optimizer normally picks an inclusion join with a pre-join sort that removes duplicates on (t1.y1, t1.z1). If there is skew on (t1.y1, t1.z1), the pre-join sort removes the duplicates and removes the skew as well. Because of this, the Optimizer does not evaluate a PRPD plan by considering t1 to be a skewed table. However, if there is skew on (t2.y2, t2.z2), the Optimizer does consider a PRPD plan.

SELECT *
FROM t2
WHERE t2.y2 IN (SELECT t1.y1
                FROM t1
                WHERE t2.z2 = t1.z1);

Related Information

Product joins, see Product Join.

Hash joins, see Hash Join.

rowID joins, see RowID Join.

Column-partitioned tables and join indexes, see the information about CREATE TABLE and CREATE JOIN INDEX in Teradata Vantage™ - SQL Data Definition Language Detailed Topics, B035-1184 and the information about primary indexes and hash and join indexes in Teradata Vantage™ - Database Design, B035-1094.