Optimizer Join Plans | Join Planning/Optimization | Teradata Vantage - Optimizer Join Plans - Advanced SQL Engine - Teradata Database

SQL Request and Transaction Processing

Product
Advanced SQL Engine
Teradata Database
Release Number
17.10
Published
July 2021
Language
English (United States)
Last Update
2021-07-28
dita:mapPath
uqf1592445067244.ditamap
dita:ditavalPath
uqf1592445067244.ditaval
dita:id
B035-1142
lifecycle
previous
Product Category
Teradata Vantage™

Selecting the optimum method to join relations is as critical to the performance of a query as the selection of table access methods.

There are many ways to join relations. The method the Optimizer ultimately chooses depends on multiple factors, including the comparison on which the join is made, the estimated cardinality of the relations being joined, the configuration of the system on which the join is to be made, and so on.

This section describes join processing at several levels.

Components of a Join Plan

Join planning involves the following core components:
  • Selecting a join method.

    There are often several possible methods that can be used to make the same join. For example, it is usually, but not always, less expensive to use a merge join rather than a product join. The choice of join method often has a major effect on the overall cost of processing a query.

  • Determining an optimal join geography.

    Different methods of relocating rows to be joined can have very different costs. For example, depending on the size of the relations in a join operation, it might be less costly to duplicate one of the relations rather than redistributing it.

  • Determining an optimal join order.

    The sequence in which relations are joined can have a powerful impact on join cost. In this context, a table could be joined with a spool rather than another table. The term table is used in the most generic sense of the word, and the logical term relation is often used as a substitute for any table, view, or spool.

  • Determining the optimal spool format.

    Teradata can select regular format or in-memory format, depending on the cost. The cost model uses various parameters, including the production of an in-memory spool. In some cases the optimizer converts a regular table to an in-memory spool format table.

The following table outlines some of the terminology introduced by join planning:

Term Type Definition Example
Bind An equality constraint between two columns of different relations.

Normally used to determine whether there are conditions on an index.

(t1.a= t2.a)
Cross A predicate with expressions on different relations.

A cross term can be generated for a condition of the form column_name=expression, which is referred to as a half cross term.

The Optimizer can use a half cross term as a constraint if it is specified on an index.

The form conversion(column)=expression can be used for the same purpose if conversion(column) and column are hashed the same.

(t1.a= t2.a+1)
Exists A predicate that specifies an EXISTS condition.  
Explicit A predicate defined on a constant.

Normally used to determine whether there are conditions on an index.

(t1.a=1)
Minus A predicate that specifies a MINUS condition.  
Miscellaneous Literally a group of terms that do not belong to any of the other categories.
The set of miscellaneous terms includes the following:
  • Inequality terms.
  • Equality terms on either the same set of relations or on non-disjoint sets of relations.
  • ANDed terms.
  • ORed terms.
  • Other terms (for example, terms expressed over more than 2 relations).

A miscellaneous term can be generated for a conversion(column) = constant condition. If conversion(column) and (column) hash the same, then the miscellaneous term points to an explicit term.

For example, if the operation undertaken by the conversion is a simple renaming of the column. In this case, the miscellaneous term is used to determine whether a constraint is specified on an index column.

 
Outer join An outer join term.  

The first thing the Optimizer looks for when planning a join is connecting conditions, which are predicates, that connect an outer query and a subquery. The following are all examples of such connecting conditions:

(t1.x, t1.y) IN (SELECT t2.a, t2.b FROM t2)
--> (t1.x IN spool1.a AND (t1.y IN spool1.b)

(t1.x, t1.y) IN (SELECT t2.a, constant FROM t2)
--> (t1.x IN spool1.a) AND (t1.y=spool1.constant)

(t1.x, constant) NOT IN (SELECT t2.a, t2b FROM t2)
--> (t1.x NOT IN spool1.a) AND (constant NOT IN spool1.b)
Additionally, the Optimizer analyzes conditions to determine the connections between relations in a join operation:
  • There is a direct connection between two relations if either of the following conditions is found:
    • An ANDed bind, miscellaneous, cross, outer join, or minus term that satisfies the dependent info between the two relations.
    • A spool of a noncorrelated subquery EXIST condition connects with any outer relation.
  • An ANDed miscellaneous or explicit term on a single relation is pushed to the relation.
  • A term on no relation is pushed to a relation.
  • An ORed term that references some subqueries and a single relation is associated with that relation as a complex term.
  • All relations that are referenced in an ORed term that specifies subqueries and more than one relation are put into complex set.
  • All relations that are specified in some join condition are marked as connected.
  • Assume selection and projection are done if a relation is spooled before join, for example:
    SELECT t1.x1
    FROM t1, t2
    WHERE y1=1
    AND   x1= x2;
  • Find the following information about all relations in the set of input relations:
    • Its row size after applying projection.
    • Its cardinality after applying selection conditions.
    • The cost to read it based on the previously determined row size and cardinality.
    • Its output row cost.
    • Its primary index, if it has one.
    • The set of connected relations such as the following:
      SELECT t1.x1
      FROM t1, t2
      WHERE y1=1
      AND   x1= x2;
  • Find the following information about all base table relations in the set of input relations:
    • The relation row size.
    • The relation cardinality.
    • The selection condition list.
    • The projection list.
    • The best possible access paths (using calls to access planning functions).
  • Find the following information about all spool relations in the set of input relations:
    • The spool cardinality
    • The selection condition list.
    • Its assignment list.
    • Its spool number.

Join Processing Methods

Depending on the indexes defined for the tables involved and whether statistics are available for the indexes, the Optimizer processes a join using a join method such as one of the following:
  • Product join
  • Hash join
  • Merge join
  • Nested join
  • Exclusion join
  • Inclusion join
  • RowID join
  • Correlated join
  • Minus all join

See Join Strategies and Methods and the pages following for more information about the various join methods.

Limit on the Number of Tables or Single-Table Views That Can Be Joined

Excluding self-joins, as many as 128 tables or single-table views can be joined per query block. The maximum number of tables and single-table views that can be joined per query block is determined by the value of the MaxJoinTables performance field of the DBS Control record, which ranges from a minimum of 64 to a maximum of 128. See Teradata Vantage™ - Database Utilities, B035-1102 for details.

Loosely defined, a query block is a unit for which the Optimizer attempts to build a join plan. The following list notes a few of the more frequently occurring query blocks:
  • Noncorrelated subqueries
  • Derived tables
  • Complicated views
  • Portions of UNION and INTERSECT operations

Each reference to a relation, including those using correlation names, counts against the limit of 128 tables. This limit includes implicit joins such as the join of a hash or single-table join index to a base table to process a partial cover.

For example, consider a query with a single noncorrelated subquery. The subquery is limited to 128 tables and the outer query is limited to 127 tables, the 128th table for the outer query being the spooled result of the inner query that must be joined with it.

If the noncorrelated subquery were an outer query to an additional noncorrelated subquery, then the deepest subquery would be limited to referencing 128 tables, its outer query limited to 127 (127 plus the result of its inner query), and the parent outer query to 127 plus the result of its inner query.

In summary, while the number of tables, including intermediate spool relations, that can be joined is limited to 128 per query block, the cumulative number of tables referenced in the course of optimizing the query can be considerably greater than 128.

There is no way to determine a priori how many query blocks will be created and processed by the Optimizer in the course of producing a join plan, but the factors listed here are all candidates to evaluate if your queries terminate because they exceed the 128 table join limit.

Recommendation for Joins and Domains

Always join tables and views on columns that are defined on the same domain.

If the join columns are not defined on the same domain, then the system must convert their values to a common data type before it can process the join (see Teradata Vantage™ - Data Types and Literals, B035-1143 for information about implicit data type conversions). The conversion process is resource-intensive and thus a performance burden.

Distinct user-defined data types (UDTs) are a good method of defining domains that also bring strong typing into the picture, eliminating the possibility of implicit type conversions unless the UDT designer explicitly codes them.

This positive is balanced by the negative that you cannot define constraints on UDT columns. Because of this limiting factor, whether distinct UDTs are a good way for you to define your domains or not is a matter of balancing several factors and determining which method, if either, best suits domain definitions in your development environment.

See Teradata Vantage™ - SQL External Routine Programming, B035-1147 and Teradata Vantage™ - SQL Data Definition Language Syntax and Examples, B035-1144 for information about UDTs. See Teradata Vantage™ - Database Design, B035-1094 for information about using UDTs to define domains.

Amount of Parser Memory Required for Large Joins

If you plan to join more than 32 relations in a request, use the DBS Control Utility to increase the value for the MaxParseTreeSegs field (in the Performance group) to something greater than the default number of parse tree memory segments allocated to Parser memory for code generation. For more information about the DBS Control utility, see Teradata Vantage™ - Database Utilities, B035-1102

The MaxParseTreeSegs field defines the maximum number of parse tree memory segments the Parser allocates when parsing an SQL request. The database allocates the parse tree segments dynamically as they are needed, so this field does not unnecessarily preallocate memory that might not be used.

Evaluating Join Costing

Because that the Optimizer is cost-based, it evaluates the relative costs of the available join methods (see Join Strategies and Methods) to determine the least expensive method of joining two relations.

As an example, consider an equality product join. In this case, the smaller relation is duplicated on all AMPs and then every row in the smaller relation is compared with every row in the larger relation. The total number of comparisons to do the join is the product of the number of rows in the smaller relation and the number of rows in the larger relation.

If the estimated cardinality of the small relation is 5 rows, then each row in the right relation is estimated to make 5 comparisons. If the estimate is 500 rows, the estimated number of comparisons is 500. This is a difference of two orders of magnitude. In the former case, the product may be the least costly method, while in the later case, another join method may be less costly.

Note that the ordering of joins is a separate process from the selection of methods to use for those joins. See Determining the Order of Joins for information about how the Optimizer evaluates join orders.

Planning n-way Joins

If more than two relations are specified to be joined, the Optimizer considers joining the relations as a series of binary joins and then attempts to determine the most cost-effective join order. the Optimizer, in some cases, also considers joining a subset of the relations using an n-way join in a single step.

For example, assume you submitted the following multitable request:

SELECT ...
FROM A, B, C, D
WHERE ...;

The Optimizer generates join plans like the three diagrammed in the following graphic. Because the Optimizer uses column statistics to choose the least costly join plan from the candidate set it generates, the plans it generates are dependent on the accuracy of those numbers.


Examples of three join plans

Column projection and row selection may be done prior to doing the join. If selection criteria are not supplied, then all rows and columns participate in the join process.

It is always advantageous to reduce the number of rows and columns that must be duplicated or redistributed. However, if a join can be done directly with the base table, it may be more efficient for column projection and row selection to be done in the join step after determining if two rows qualify to be joined.

Join Order Search Trees

Query optimizers use trees to build and analyze optimal join orders. The join search tree types used most frequently by commercial relational database optimizers are the left-deep tree and the bushy tree.

Left-Deep Search Trees

When a left-deep search tree is used to analyze possible join orders, the number of possibilities produced is relatively small, as characterized by the following equation, where n is the number of relations being joined.


Number of join orders equation

For a 4-relation join, the number of join orders using a left-deep search tree is only 4!, or 24. Left-deep join trees are used by many commercial relational systems, but there are other methods that can produce many more possible join sequences, making it more likely to find a better join plan.

The following diagram illustrates a left-deep join tree for a 4-relation join, reading from the bottom up:



Bushy Search Trees

Bushy trees are an optimal method for generating more join order combinations. At the same time, the number of combinations generated can be prohibitive, so the Optimizer uses several heuristics to prune the search space. For example, with the exception of star joins, bushy search trees are not considered for unconstrained joins.

Ignoring the pruning of unconstrained joins, this method produces an order of magnitude more join possibilities that can be evaluated than the left-deep tree method. Bushy trees also provide the capability of performing some joins in parallel. For example, consider the following four0relation case:

((A JOIN B) JOIN (C JOIN D)

(A JOIN B) and (C JOIN D) can be dispatched for processing at the same time. A system that uses only left-deep trees cannot perform this kind of parallel join execution.

Consider the case of four relations. The number of permutations of four relations is 4! or 24. If the relations are named a, b, c, and d, then those 24 permutations are as follows:

abcd, abdc, acbd, acdb ... dcba

Because the system can only join two relations at a time, these 24 permutations must be further partitioned into all their possible binary join orders as follows:

abcd => (((ab)c)d)
        ((ab)(cd))
        (a(b(cd)))
        ((a(bc))d)
        ((a(b(cd))
            .
            .
            .

and so on.

The following graphic illustrates one possible sequence of binary joins of relations a, b, c, and d. Note the following about this description:
  • The graphic is read from the bottom up.
  • Intermediate join results are referred to as relations, not tables.

The illustrated process can be described as follows:

  1. Join table a with table b to create the intermediate relation ab.
  2. Join table c with table d to create the intermediate relation cd.
  3. Join relation ab with relation cd to create the final joined result, relation abcd.


Depending on table demographics and environmental costs, the Optimizer could just as likely produce the following join sequence:

  1. Join table a with table b to create the intermediate relation ab.
  2. Join relation ab with table c to create the intermediate relation abc.
  3. Join relation abc with table d to create the final joined result, relation abcd.

The Optimizer is very intelligent about looking for plans and uses numerous field-proven heuristics to ensure that more costly plans are eliminated from consideration early in the costing process in order to optimize the join plan search space.