15.10 - Join Strategies and Methods - Teradata Database

Teradata Database SQL Request and Transaction Processing

prodname
Teradata Database
vrm_release
15.10
category
Programming Reference
User Guide
featnum
B035-1142-151K

The Optimizer has several general strategies for joining tables, including those based on factors such as join geography and join order. The Optimizer also has many join methods, or modes, to choose from to ensure that any join operation is fully optimized. Each type of join method available to the Teradata Optimizer is described in the sections that follow. Examples of join methods include product joins, merge joins, and hash joins.

The particular processing described for a given type of join (for example, duplication or redistribution of spooled data) might not apply to all joins of that type.

The following two procedures are key factors for optimizing your SQL queries:

  • Collect statistics regularly on all your regular join columns. Accurate table statistics are an absolute must if the Optimizer is to consistently choose the best join plan for a query.
  • For further information on the Optimizer form of the COLLECT STATISTICS statement, see SQL Data Definition Language.

  • To obtain an accurate description of the processing that will be performed for a particular join, always submit an EXPLAIN request modifier or perform the Visual Explain utility for the query containing the join expression. You can often reformulate a query in such a way that resource usage is more highly optimized.
  • For further information on the EXPLAIN request modifier, see SQL Data Manipulation Language.

    For further information on the Visual Explain utility, see Teradata Visual Explain User Guide.

    The following table summarizes some of the major characteristics of the most commonly used join algorithms.

     

    Join Method

                                                  Important Properties

    Product

  • Always selected by the Optimizer for WHERE clause inequality conditions.
  • High cost because of the number of comparisons required.
  • Merge
  • Hash
  • When done on matching primary [AMP] indexes, does not require any data to be redistributed.
  • Hash joins are often better performers and are used whenever possible. They can be used for equijoins only.
  • Nested

  • Only join expression that generally does not require all AMPs.
  • Preferred join expression for OLTP applications.
  • When one or both tables in a join contain skewed values, the performance of the join operation is frequently degraded. Skew on certain values refers to the variation in the number of rows that contain those values among the various AMPs. If the variation is high, meaning that some AMPs have a high number of rows with those values and some AMPs have far fewer rows with the values, the values are described as being skewed and the base table is said to be skewed on these values.

    For example, consider the 2 tables product and sales, which are joined on the condition product.product_id = sales.product_id, and column sales.product_id is skewed on the value 1.

    The classic join strategies pursue various geographic methods such as those listed below, where the relative sizes of the individual tables play a major role in selecting the optimal strategy. For example, if the table being duplicated in the second or third plan is large or if both the tables are large, any plan chosen from the list can degrade performance.

  • Redistribute product on product.product_id and redistribute sales on sales.product_id.
  • Duplicate product and access sales locally/directly.
  • Access product locally/directly and duplicate sales.
  • A partial redistribution partial duplication (PRPD) join strategy helps to minimize the impact of skew on join performance by, for example, making the join between sales and product as 2 separate joins.

  • For the first join, PRPD keeps the rows with the skewed value 1 of sales.product_id locally on each AMP and duplicates rows from product.product_id that match the skewed value of sales.product_id to all AMPs and then joins only those rows.
  • For the second join, PRPD joins the non-skewed rows from the sales table and the rest of the rows from the product table using whatever join method the Optimizer determines to be best.
  • Teradata Database combines the results of these 2 joins as the final join result.

    The process of dividing the rows in a single source into several subparts is referred to as a split. Using the PRPD join strategy, Teradata Database splits both the sales and product tables into 2 relations that participate in 2 regular joins later in the process. For the subparts of the 2 tables with skewed values, the Optimizer also selects the best join plan based on the costs of the different join plans, so the geographies are set according to the most optimal join plan.

    The preceding case is just one possible example of PRPD when there is a single skewed value in a single column of one table. The Optimizer can also use PRPD when there is more than one skewed values and multiple join columns with skew in one or both of the tables. When the set of join conditions is on expressions of base table columns and statistics have been collected on the expressions, the Optimizer can also use PRPD to join skewed tables.

    For the preceding example, the Optimizer selects a local geography for the skewed rows in sales and a duplication geography for the rows with skewed values in product. These geographies are not fixed in PRPD.

    PRPD requires the same demographic support as any other join operation, such as accurate statistics, which it uses to determine the list of skewed values. PRPD uses skew detection logic to update existing statistics if the Optimizer determines that is necessary. The Optimizer selects a join plan using PRPD only if it is cheaper than other join methods. When both relations being joined are skewed on the same value, the selection of which relation is the skewed relation depends on the number of rows with skewed values and the row size of both relations. The Optimizer selects the relation with a higher cost based on those factors as the skewed relation.

    The main challenge the Optimizer faces when it determines whether to use PRPD for a join is to detect the biased values and their frequencies because the single table conditions and previous joins, if any, might filter out some loners and alter the frequency of the surviving loners. The Optimizer does not always have accurate loner information for PRPD planning.

    PRPD is designed to operate in one of 2 modes, depending on an internal parameter setting.

     

    In this mode …

    The Optimizer considers a PRPD plan …

    normal

    only when the surviving biased values and their frequencies can be determined with some confidence.

    This is the default setting.

    aggressive

    even when there is no proper information about the surviving biased values.

    The biased values are derived based on heuristics.

    Following are some examples to clarify when the Optimizer can try a PRPD plan with join operations.

    Consider the following 3 tables for the example set.

  • t1(x1, y1, z1), primary index defined on (x1)
  • t2(x2, y2, z2), primary index defined on (x2)
  • t3(x3, y3, z3), primary index defined on (x3)
  • In the following examples, when it is said that a relation qualifies for PRPD, it means that the following items are true.

  • Statistics are collected on the hashed join column set.
  • The demographics of the join column set satisfy a set of conditions that are determined by an internal parameter setting.
  • Normal PRPD mode is assumed by default. If an example applies to aggressive mode only, that is explicitly stated.

    In this example, if either t1 or t2 qualifies for PRPD, the Optimizer tries a PRPD plan by considering t1 or t2 as the skewed relation. In this case, there are two partial joins.

    If both tables t1 and t2 are skewed, the Optimizer tries a PRPD plan using 3 joins.

         SELECT * 
         FROM t1, t2 
         WHERE t1.y1 = t2.y2;

    In this example, the derived statistics logic can find the surviving loners from the histogram with some confidence so t1 qualifies for PRPD if statistics have been collected on y1.

         SELECT * 
         FROM t1,t2 
         WHERE t1.y1 = t2.y2 
         AND   t1.y1 > 5;

    In this example, the derived statistics logic can find the range of surviving values for t1.y1 after applying the single‑table condition, so t1 qualifies for PRPD if statistics have been collected on (t1.y1) and (t1.z1, t1.y1).

    If multicolumn statistics on (t1.z1, t1.y1) have not been collected, the Optimizer cannot find the surviving loners on column t1.y1 after it applies the single-table condition so t1 does not qualify for PRPD in normal mode. For this case, the Optimizer considers PRPD only if the internal parameters for PRPD are set for aggressive mode.

         SELECT * 
         FROM t1,t2 
         WHERE t1.y1 = t2.y2 
         AND   t1.z1 > 5;

    There are other scenarios as well where the Optimizer can find the surviving values on the join column with some confidence when there are single‑table conditions. An example is when there is a sparse join index that covers the table and statistics have been collected on the join column in the join index. For this example, the join can qualify for PRPD if there is a join index with the following definition and statistics are collected on j1.y1, t1 can qualify for PRPD.

         CREATE JOIN INDEX j1 AS 
           SELECT * 
           FROM t1 
           WHERE z1 > 5;

    The Optimizer evaluates t1 for PRPD only when statistics on the expression (t1.y1 + t1.z1) are available. If the Optimizer determines that t1 qualifies for PRPD based on (t1.y1 + t1.z1) expression statistics, it tries a PRPD plan similar to “Example 1: PRPD for Multiple Skewed Tables” on page 395.

         SELECT * 
         FROM t1,t2 
         WHERE t1.y1 + t1.z1 = t2.y2;

    Suppose t1.y1 is skewed, statistics have been collected on t1.y1, and the first join (call it R4) is between tables t1 and t3. After the first join, the Optimizer cannot determine which loners from y1 survived the join condition t1.x1 = t3.x3, so for the R4 X R2 join, R4 does not qualify for PRPD.

    For this case, the Optimizer considers PRPD only if the internal parameters for PRPD are set for aggressive mode.

         SELECT *
         FROM t1,t2, t3 
         WHERE t1.y1 = t2.y2 
         AND   t1.x1 = t3.x3;

    Suppose (t1.y1, t1.z1) and (t2.y2, t2.z2) are skewed on some values and you have collected multicolumn statistics on those column sets.

    To process this request, the Optimizer normally picks an inclusion join with a pre-join sort that removes duplicates on (t1.y1, t1.z1). If there is skew on (t1.y1, t1.z1), the pre-join sort removes the duplicates and removes the skew as well. Because of this, the Optimizer does not evaluate a PRPD plan by considering t1 to be a skewed table. However, if there is skew on (t2.y2, t2.z2), the Optimizer does consider a PRPD plan.

         SELECT * 
         FROM t2 
         WHERE t2.y2 IN (SELECT t1.y1 
                         FROM t1 
                         WHERE t2.z2 = t1.z1);

    The individual join methods each have their own specific processes for joining primary‑indexed relations with NoPI relations. This topic provides a very general, high‑level look at how such joins are made.

    NoPI relation joins with primary‑indexed relations fall roughly into the groups in the bulleted list at the bottom of this topic. If the NoPI relation is spooled, it is treated exactly like a primary‑indexed relation that is spooled, so the regular primary‑indexed join strategies apply.

    The specific explanations of the various join methods that apply to joining PPI relations with NoPI relations are described in the following topics.

  • “Product Joins With Dynamic Row Partition Elimination” on page 400
  • “Merge Join Between Tables With Different Primary Index and Join Equality Columns” on page 408
  • “Supported Join Methods for Rowkey-Based Merge Join” on page 417
  • “Dynamic Row Partition Elimination for a Sliding‑Window Merge Join Between a Primary‑Indexed Table and a NoPI Table” on page 432
  • “Dynamic Row Partition Elimination for a Sliding‑Window Merge Join Between a Primary‑Indexed Table and a NoPI Table” on page 432
  • Teradata Database can directly access a column‑partitioned table or column‑partitioned join index using a hash join or product join. This means the column‑partitioned table or join index must be joining to a duplicated spool, and that spool must be relatively small for the join to be efficient.

    Teradata Database can also directly access a column‑partitioned table or join index using a rowID join. In this case, the row IDs must come from a secondary index or from a join index if the relation is column‑partitioned, on the column‑partitioned relation, or from a previous retrieve or join to the column‑partitioned relation.

    Other joins methods are possible, but to use those methods, Teradata Database must first construct the selected rows from the column partitions and then spool them, possibly with a redistribution operation and local AMP sort or duplication to all AMPs. This might be a reasonable plan if only a few rows are selected or if only a few columns are needed from the column‑partitioned relation.

    The Optimizer also considers spooling only the columns that are needed for making the join. Teradata Database writes the row IDs of the rows that qualify the join to a row ID spool, which is then used for a rowid join to the column- partitioned table or join index for the remaining columns.

    The Optimizer also considers spooling only the columns that are needed for doing the join. In this case, the row IDs of the rows that qualify the join are written to a rowid spool, which is used for a RowID join to the column- partitioned table or join index for the remaining columns. A 2-step join plan can be more efficient if only a few rows are qualified by the join. There must be statistics on the join columns for the Optimizer to consider a 2-step join with a column-partitioned table or join index.

    If a join index is applicable, the Optimizer can make use of the index without actually having to join the relations.

    For further information about product joins, see “Product Join” on page 398.

    For further information about hash joins, see “Hash Join” on page 434.

    For further information about rowID joins, see “RowID Join” on page 477.

    For further information about column‑partitioned tables and join indexes, see “CREATE TABLE” and “CREATE JOIN INDEX” in SQL Data Definition Language Detailed Topics and the chapters on primary indexes and hash and join indexes in Database Design.