Selecting the optimum method to join relations is as critical to the performance of a query as the selection of table access methods.
There are multiple ways to join relations. The method the Optimizer ultimately chooses depends on multiple factors, including the comparison on which the join is made, the estimated cardinality of the relations being joined, the configuration of the system on which the join is to be made, and so on.
This section describes join processing at different levels.
Components of a Join Plan
- Selecting a join method.
There are often multiple methods that can be used to make the same join. For example, typically, a merge join is less expensive than a product join. The choice of join method often has a major effect on the overall cost of processing a query.
- Determining an optimal join geography.
Different methods of relocating rows to be joined can have different costs. For example, depending on the size of the relations in a join operation, duplicating one of the relations may be less costly rather than redistributing it.
- Determining an optimal join order.
The sequence in which relations are joined can have a powerful impact on join cost. In this context, a table could be joined with a spool rather than another table. The term table is used in the most generic sense of the word, and the logical term relation is often used as a substitute for any table, view, or spool.
- Determining the optimal spool format.
Teradata can select regular format or in-memory format, depending on the cost. The cost model uses multiple parameters, including the production of an in-memory spool. The optimizer may convert a regular table to an in-memory spool format table.
The following table summarizes join planning terminology.
Term Type | Definition | Example |
---|---|---|
Bind | An equality constraint between two columns of different relations. Normally used to determine whether there are conditions on an index. |
(t1.a= t2.a) |
Cross | A predicate with expressions on different relations. A cross term can be generated for a condition of the form column_name=expression, which is referred to as a half cross term. The Optimizer can use as a constraint a half cross term that is specified on an index. The form conversion(column)=expression can be used for the same purpose if conversion(column) and column are hashed the same. |
(t1.a= t2.a+1) |
Exists | A predicate that specifies an EXISTS condition. | |
Explicit | A predicate defined on a constant. Normally used to determine whether there are conditions on an index. |
(t1.a=1) |
Minus | A predicate that specifies a MINUS condition. | |
Miscellaneous | Literally a group of terms that do not belong to any of the other categories. The set of miscellaneous terms includes the following:
A miscellaneous term can be generated for a conversion(column) = constant condition. If conversion(column) and (column) hash the same, then the miscellaneous term points to an explicit term. For example, if the operation undertaken by the conversion is a simple renaming of the column. In this case, the miscellaneous term is used to determine whether a constraint is specified on an index column. |
|
Outer join | An outer join term. |
The first thing the Optimizer looks for when planning a join is connecting conditions, which are predicates, that connect an outer query and a subquery. The following are all examples of such connecting conditions:
(t1.x, t1.y) IN (SELECT t2.a, t2.b FROM t2) --> (t1.x IN spool1.a AND (t1.y IN spool1.b) (t1.x, t1.y) IN (SELECT t2.a, constant FROM t2) --> (t1.x IN spool1.a) AND (t1.y=spool1.constant) (t1.x, constant) NOT IN (SELECT t2.a, t2b FROM t2) --> (t1.x NOT IN spool1.a) AND (constant NOT IN spool1.b)
- There is a direct connection between two relations if either of the following conditions is found:
- An ANDed bind, miscellaneous, cross, outer join, or minus term that satisfies the dependent info between the two relations.
- A spool of a noncorrelated subquery EXIST condition connects with any outer relation.
- An ANDed miscellaneous or explicit term on a single relation is pushed to the relation.
- A term on no relation is pushed to a relation.
- An ORed term that references subqueries and a single relation is associated with that relation as a complex term.
- All relations that are referenced in an ORed term that specifies subqueries and more than one relation are put into complex set.
- All relations that are specified in a join condition are marked as connected.
- Assume selection and projection are done if a relation is spooled before join, for example:
SELECT t1.x1 FROM t1, t2 WHERE y1=1 AND x1= x2;
- Find the following information about all relations in the set of input relations:
- Row size after applying projection.
- Cardinality after applying selection conditions.
- Cost to read, based on the previously determined row size and cardinality.
- Output row cost.
- Primary index, if any.
- The set of connected relations such as the following:
SELECT t1.x1 FROM t1, t2 WHERE y1=1 AND x1= x2;
- Find the following information about all base table relations in the set of input relations:
- The relation row size.
- The relation cardinality.
- The selection condition list.
- The projection list.
- The best possible access paths (using calls to access planning functions).
- Find the following information about all spool relations in the set of input relations:
- The spool cardinality
- The selection condition list.
- Its assignment list.
- Its spool number.
Join Processing Methods
- Product join
- Hash join
- Merge join
- Nested join
- Exclusion join
- Inclusion join
- RowID join
- Correlated join
- Minus all join
See Join Strategies and Methods and the pages following for more information about the join methods.
Limit on the Number of Tables or Single-Table Views That Can Be Joined
Excluding self-joins, up to 128 tables or single-table views can be joined per query block. The maximum number of tables and single-table views that can be joined per query block is determined by the value of the MaxJoinTables performance field of the DBS Control record, which ranges from a minimum of 64 to a maximum of 128. See Teradata Vantage™ - Database Utilities, B035-1102 for details.
- Noncorrelated subqueries
- Derived tables
- Complicated views
- Portions of UNION and INTERSECT operations
Each reference to a relation, including those using correlation names, counts against the limit of 128 tables. This limit includes implicit joins such as the join of a hash or single-table join index to a base table to process a partial cover.
For example, consider a query with a single noncorrelated subquery. The subquery is limited to 128 tables and the outer query is limited to 127 tables, the 128th table for the outer query being the spooled result of the inner query that must be joined with it.
If the noncorrelated subquery is an outer query to an additional noncorrelated subquery, the deepest subquery is limited to referencing 128 tables, its outer query limited to 127 (127 plus the result of its inner query), and the parent outer query to 127 plus the result of its inner query.
In summary, while the number of tables, including intermediate spool relations, that can be joined is limited to 128 per query block, the cumulative number of tables referenced in the course of optimizing the query can be considerably greater than 128.
Recommendation for Joins and Domains
Always join tables and views on columns that are defined on the same domain.
If the join columns are not defined on the same domain, the system must convert their values to a common data type before processing the join (see Teradata Vantage™ - Data Types and Literals, B035-1143 for information about implicit data type conversions). The conversion process is resource-intensive and thus a performance burden.
Distinct user-defined data types (UDTs) are a good method of defining domains that also bring strong typing into the picture, eliminating the possibility of implicit type conversions unless the UDT designer explicitly codes them.
This advantage is balanced by the disadvantage that you cannot define constraints on UDT columns.
See Teradata Vantage™ - SQL External Routine Programming, B035-1147 and Teradata Vantage™ - SQL Data Definition Language Syntax and Examples, B035-1144 for information about UDTs. See Teradata Vantage™ - Database Design, B035-1094 for information about using UDTs to define domains.
Amount of Parser Memory Required for Large Joins
If you plan to join more than 32 relations in a request, use the DBS Control Utility to increase the value for the MaxParseTreeSegs field (in the Performance group) to something greater than the default number of parse tree memory segments allocated to Parser memory for code generation. For more information about the DBS Control utility, see Teradata Vantage™ - Database Utilities, B035-1102
The MaxParseTreeSegs field defines the maximum number of parse tree memory segments the Parser allocates when parsing an SQL request. The database allocates the parse tree segments as needed, so this field does not unnecessarily preallocate memory that may not be used.
Evaluating Join Costing
The Optimizer is cost-based, and therefore evaluates the relative costs of the available join methods (see Join Strategies and Methods) to determine the least expensive method of joining two relations.
As an example, consider an equality product join. In this case, the smaller relation is duplicated on all AMPs and then every row in the smaller relation is compared with every row in the larger relation. The total number of comparisons to do the join is the product of the number of rows in the smaller relation and the number of rows in the larger relation.
If the estimated cardinality of the small relation is 5 rows, then each row in the right relation is estimated to make 5 comparisons. If the estimate is 500 rows, the estimated number of comparisons is 500. This is a difference of two orders of magnitude. In the former case, the product may be the least costly method, while in the later case, another join method may be less costly.
Note that the ordering of joins is a separate process from the selection of methods to use for those joins. See Determining the Order of Joins for information about how the Optimizer evaluates join orders.
Planning n-way Joins
If more than two relations are specified to be joined, the Optimizer considers joining the relations as a series of binary joins and then attempts to determine the most cost-effective join order.The Optimizer may also consider joining a subset of the relations using an n-way join in a single step.
For example, assume you submitted the following multitable request:
SELECT ... FROM A, B, C, D WHERE ...;
The Optimizer generates join plans like those in the following graphic. Because the Optimizer uses column statistics to choose the least costly join plan from the generated candidate set, its plans depend on the accuracy of those numbers.
Column projection and row selection may be done prior to doing the join. If selection criteria are not supplied, then all rows and columns participate in the join process.
Reducing the number of rows and columns that must be duplicated or redistributed is always advantageous. However, joining directly with the base table may be more efficient for column projection and row selection to be done in the join step after determining if two rows qualify to be joined.
Join Order Search Trees
Query optimizers use trees to build and analyze optimal join orders. The join search tree types used most frequently by commercial relational database optimizers are the left-deep tree and the bushy tree.
Left-Deep Search Trees
When a left-deep search tree is used to analyze possible join orders, the number of possibilities produced is relatively small, as characterized by the following equation, where n is the number of relations being joined.
For a 4-relation join, the number of join orders using a left-deep search tree is 4!, or 24. Left-deep join trees are typically used by commercial relational systems, but other methods can produce other join sequences, increasing the likelihood of finding a better join plan.
The following diagram illustrates a left-deep join tree for a 4-relation join, reading from the bottom up:
Bushy Search Trees
Bushy trees are an optimal method for generating more join order combinations. However, the number of combinations generated can be prohibitive, so the Optimizer uses heuristics to prune the search space. For example, with the exception of star joins, bushy search trees are not considered for unconstrained joins.
Ignoring the pruning of unconstrained joins, this method produces an order of magnitude more join possibilities that can be evaluated than the left-deep tree method. Bushy trees also provide the capability of performing joins in parallel. For example, consider the following four-relation case:
((A JOIN B) JOIN (C JOIN D)
(A JOIN B) and (C JOIN D) can be processed at the same time. A system that uses only left-deep trees cannot perform parallel join execution.
Consider the case of four relations. The number of permutations of four relations is 4! or 24. If the relations are named a, b, c, and d, then those 24 permutations are as follows:
abcd, abdc, acbd, acdb ... dcba
Because the system can only join two relations at a time, these 24 permutations must be further partitioned into all their possible binary join orders as follows:
abcd => (((ab)c)d) ((ab)(cd)) (a(b(cd))) ((a(bc))d) ((a(b(cd)) . . .
and so on.
- The graphic is read from the bottom up.
- Intermediate join results are referred to as relations, not tables.
- Join table a with table b to create the intermediate relation ab.
- Join table c with table d to create the intermediate relation cd.
- Join relation ab with relation cd to create the final joined result, relation abcd.
Depending on table demographics and environmental costs, the Optimizer could just as likely produce the following join sequence:
- Join table a with table b to create the intermediate relation ab.
- Join relation ab with table c to create the intermediate relation abc.
- Join relation abc with table d to create the final joined result, relation abcd.
The Optimizer is intelligent about looking for plans and uses field-proven heuristics to eliminate more costly plans from consideration early in the costing process, to optimize the join plan search space.