Heuristics for Determining a Reasonable Answer Set - Advanced SQL Engine - Teradata Database

SQL Data Manipulation Language

Product
Advanced SQL Engine
Teradata Database
Release Number
17.00
Published
September 2020
Language
English (United States)
Last Update
2021-01-23
dita:mapPath
qtb1554762060450.ditamap
dita:ditavalPath
lze1555437562152.ditaval
dita:id
B035-1146
lifecycle
previous
Product Category
Teradata Vantage™
Before you can judge the accuracy of the results returned from an outer join query, you need to estimate a reasonable guess about the likelihood that an answer set is correct. This section provides two simple methods for obtaining the approximate numbers of rows you should expect from a correctly structured outer join query:
  • Example 1: A single table SELECT of an outer join used for this study.
  • Example 2: An inner join that explains the remaining queries and results in the case study. It starts with the same base of customer rows but matches them with revenue rows for a particular month.

Single Table Cardinality Estimate

This query is a single table SELECT that provides the cardinality of the customer table as restricted by the conditions used for the example outer join queries.

From the report returned by this query, we know how many rows should be returned by the outer join.

     SELECT  c.cust_num
     FROM sampdb.customer c
     WHERE c.district = 'K'
     AND  (c.service_type = 'ABC'
     OR    c.service_type = 'XYZ')
     ORDER BY 1;

This query returns 18,034 rows.

Inner Join Cardinality Estimate

This query is an inner join that helps to explain the example outer join queries and their results. The query starts with the same customer rows found in Single Table Cardinality Estimate, but then matches them with revenue rows for the month of July, 1997.

     SELECT c.custnum, b.monthly_revenue
     FROM sampdb.customer AS c, sampdb2.revenue AS b
     WHERE c.custnum = b.custnum
     AND   c.district = 'K'
     AND   b.data_year_month = 199707
     AND  (c.service_type = 'ABC'
     OR    c.service_type = 'XYZ')
     ORDER BY 1;

The query returns 13,010 rows.

Note that all customer table rows are matched by a monthly_revenue table row.

Example 1: Outer Join That Is Really an Inner Join

This example 1 makes an explicit outer join request, but its result might surprise you.

     SELECT c.custnum, b.monthly_revenue
     FROM sampdb.customer AS c
     LEFT OUTER JOIN sampdb2.revenue AS b ON c.custnum = b.custnum
     WHERE c.district ='K'
     AND   b.data_date = 199707
     AND  (c.service_type = 'ABC'
     OR    c.service_type = 'XYZ')
     ORDER BY 1;

This query returns 13,010 rows, the same as the simple inner join of Inner Join Cardinality Estimate.

Analysis of the Result

Although the request specifies a LEFT OUTER JOIN, the Optimizer treats it as a simple inner join because all the selection criteria are in the WHERE clause, so they are logically applied only after the outer join processing has been completed. Examples 1 and 3 are logically identical, so they produce the same result.

An EXPLAIN shows that the Optimizer recognizes this query to be an inner join and executes it as such (“...joined using a merge join...” rather than “...left outer joined using a merge join...”). Therefore, it executes with the speed of an inner join.

The Optimizer selects the best join algorithm for this inner join, a merge join, and applies the conditions from the WHERE clause. For information about merge join, see Teradata Vantage™ - SQL Request and Transaction Processing, B035-1142.

Logically, the WHERE clause terms are applied after the terms in the ON clause of the LEFT OUTER JOIN have been applied, which results in an intermediate step of 18,034 rows. However, the WHERE clause includes B.DATA_YEAR_MONTH=199707, eliminating all rows where the value for B.DATA_YEAR_MONTH is null and returning the incorrect result of 13,010 rows.

The Optimizer recognizes that the WHERE clause references an inner table that does not evaluate to TRUE for nulls and is therefore equivalent to an inner join. As a result, it generates a plan for a simple inner join, ignoring the request for a left outer join.

The EXPLAIN shows “...SAMPDB.CUSTOMER and SAMPDB2.REVENUE are joined using a merge join...” Had the plan used an outer join, the EXPLAIN text would have said “...SAMPDB.CUSTOMER and SAMPDB2.REVENUE are left outer joined using a merge join...”

The result of the query, 13,010 rows, is not the correct answer to our business question.

The Coding Error

The SELECT statement errs by placing the outer table in the query (customer) in the WHERE clause rather than the ON clause, resulting in the query being processed as a simple inner join rather than the intended left outer join.

You can examine this from two different perspectives, both of which explain why the expected result was not returned.

Logical explanation

The WHERE clause appears later in the query than the ON clause. Because of this, the WHERE clause predicates are evaluated after the ON clause predicates.

The result table for the ON clause has 18,034 rows. When the WHERE clause predicate b.data_year_month=199707 is applied, you should expect the ensuing result table to have fewer rows because this condition eliminates any results rows where b.data_year_month is null, and an outer join, by definition, produces results nulls on selection predicates.

Physical explanation

The EXPLAIN text for this query does not follow the logical explanation.

One of the fundamental tasks of the Optimizer is to reorganize queries into an algebraic form that can be analyzed quickly. The Optimizer recognizes that a predicate referencing the inner table of an outer join does not evaluate to TRUE when a column referenced in that predicate expression contains nulls.

The Optimizer recognizes the join to be an inner join, not an outer join. As a result, it generates a plan to perform an inner join. The EXPLAIN includes “...are joined using a merge join...” The wording would have been “...are joined using an outer merge join...” if the query had been formulated correctly as an outer join.

While this is the correct interpretation of the SELECT statement as written, the query does not formulate the business question correctly.

Example 2: Outer Join But Does Not Return the Correct Answer

Example 2 is a correctly formed outer join, but it neither addresses the question posed nor returns the correct answer.

     SELECT c.custnum, b.monthly_revenue
     FROM sampdb.customer AS c
     LEFT OUTER JOIN sampdb2.revenue AS b ON c.custnum = b.custnum
     AND  c.district ='K'
     AND  b.data_year_month = 199707
     AND (c.service_type ='ABC'
     OR   c.service_type ='XYZ')
     ORDER BY 1;

This query returns 17713502 rows.

Analysis of the Result

The query specifies a left outer join without qualifying the results table with a WHERE clause. The Optimizer constructs a plan for a left outer join as specified by the query. The result is a join of every row in the outer table (17,713,502 rows) with the qualified rows of the inner table, those service types of either ABC or XYZ in district K in the month of July, 1997 (13,010 rows).

The cardinality of this result, 17,713,502 rows, is three orders of magnitude larger than the correct answer to the business question.

Without a WHERE clause to qualify the result, an outer join result always contains at least 1 row for every row in the outer table.

Qualifying predicates placed in the ON clause do not eliminate rows from the result. Rather, they are treated as rows that are not matched with the inner table irrespective of whether join terms that reference both the inner and outer tables evaluate to TRUE. In other words, the selection criteria of an ON clause only define the rows to which nulls are to be appended for nonmatching rows. They do not restrict the result to rows matching the conditions they specify.

The Optimizer does qualifies the result to a degree, returning only those rows for which the predicate b.data_year_month = 199707 evaluates to TRUE.

The Optimizer selects the best join algorithm for this left outer join (a merge join) and applies the conditions from the ON clause.

The result of the query, 17,713,502 rows, which is not the correct answer to our business question.

The Coding Error

The SELECT statement errs by placing all of the selection criteria for the query in the ON clause. The ON clause should only include selection criteria for the inner table in the outer join, that is, the criteria that define the nullable nonmatching rows for the outer join.

Selection criteria for the outer table (the customer table in this case) in the outer join must be placed in the WHERE clause.

Example 3: Outer Join That Is Really an Inner Join

This example is another query where an explicit outer join is specified, but that join is logically an inner join, so the Optimizer transforms it into a simple inner join before performing the query.

     SELECT c.custnum, b.monthly_revenue
     FROM sampdb.customer c
     LEFT OUTER JOIN sampdb2.revenue b ON c.custnum = b.custnum
     AND  c.district ='K'
     AND (c.service_type = 'ABC'
     OR   c.service_type = 'XYZ')
     WHERE b.data_year_month = 199707
     ORDER BY 1;

This query returns 13,010 rows.

Analysis of the Result

This example is similar to Example 1. The Optimizer treats this query as an inner join even though the request explicitly specified an outer join. The WHERE clause on the inner table logically changes this query from an outer join to an inner join.

As in previous examples, the WHERE clause is logically applied after the outer join processing completes, removing all rows that were nulled in the process: nonmatching rows between the left and right tables. The Optimizer knows to execute this as an inner join to improve the performance of the query.

The EXPLAIN text matches the EXPLAIN text for Example 1. As expected, the answer set also matches.

The Optimizer selects the best join algorithm for this inner join (a merge join) and applies the conditions from the WHERE clause.

Logically, the WHERE clause term is applied after the terms in the ON clause of the LEFT OUTER JOIN have been applied. That would result in an intermediate step of 18,034 rows. However, the term b.data_year_month=199707 from the WHERE clause would then be applied, eliminating all rows where the value for b.data_year_month is null, and returning the final, incorrect, result of 13,010 rows.

The Optimizer recognizes that the WHERE clause references an inner table that does not evaluate to TRUE for nulls and is therefore equivalent to an inner join. As a result, it generates a plan for a simple inner join, ignoring the request for a left outer join.

The EXPLAIN text says “...SAMPDB.CUSTOMER and SAMPDB2.REVENUE are joined using a merge join...” Had the plan used an outer join, the EXPLAIN text would have said “...SAMPDB.CUSTOMER and SAMPDB2.REVENUE are left outer joined using a merge join...”

The result of the query, 13,010 rows, which is not the correct answer to our business question.

The Coding Error

The SELECT statement errs by placing:

    b.data_date = 199707

in the WHERE clause rather than the ON clause, resulting in the query being processed as a simple inner join rather than the left outer join as intended.

Example 4: Outer Join, the Correct Answer

Finally, we have the correct answer. This example is an outer join that provides the desired answer to the original business question.

     SELECT c.custnum, b.monthly_revenue
     FROM sampdb.customer AS c
     LEFT OUTER JOIN sampdb2.revenue AS b ON c.custnum = b.custnum
     AND   b.data_year_month = 199707
     WHERE c.district ='K'
     AND  (c.service_type = 'ABC'
     OR    c.service_type = 'XYZ')
     ORDER BY 1;

This query returns 18,034 rows.

Analysis of the Result

The cardinality of this query result, 18,034 rows, reconciles with the expected number of returned rows. See Single Table Cardinality Estimate.

13,010 rows have values (non-nulls) for monthly_revenue.

The EXPLAIN shows that the system performs a left outer join.

The Optimizer selects the best join algorithm for this outer join (a Merge Join) and applies the conditions from the WHERE and ON clauses, respectively.

The left (outer) table is limited by the search conditions in the WHERE clause, and the search condition in the ON clause for the right (inner) table defines the nullable nonmatching rows.

The EXPLAIN text confirms that this is a true outer join (“...SAMPDB.c and SAMPDB2.b are left outer joined...”).

The result of the query, 18,034 rows, the correct answer to our business question.