16.10 - Sliding-Window Merge Join - Teradata Database

Teradata Database SQL Request and Transaction Processing

prodname
Teradata Database
vrm_release
16.10
created_date
June 2017
category
Programming Reference
User Guide
featnum
B035-1142-161K

About the Sliding-Window Merge Join

A direct merge join cannot be used when one table is partitioned and the other is not, or when both tables are partitioned, but not partitioned identically, because the rows of the two tables are not ordered in the same way. The sliding-window merge join, which is PPI-aware, can be applied by the Optimizer to cover situations that the standard direct merge join cannot. Sliding-window joins can be slower than a merge join or rowkey-based merge join when there are many noneliminated nonempty combined partitions; however, a sliding-window merge join can provide roughly similar elapsed time performance (but with greater CPU utilization and memory consumption) when the number of noneliminated nonempty combined partitions is small.

Sliding-window merge joins follow the general principle of structuring each of the left and right relations into windows of appropriate sizes, appropriate in terms of the number of partitions required as determined by the Optimizer based on the available memory for making the join. The join is done as a product join between each of these left and right window pairs. The operation uses the identical algorithm to that used for a regular merge join within each window pair with the exception that the rows are not necessarily in row hash order across the multiple partitions within a window.

The obvious way to join a nonpartitioned table to a row-partitioned PI table would be to make one pass over the nonpartitioned table for each noneliminated nonempty partition of the row-partitioned PI table, executing the join as a series of subjoins. It turns out that this is an inefficient way to make the join, especially for a large nonpartitioned table. To escape the inefficiency of this method, the sliding-window merge join uses a similar concept, but minimizes the number of disk reads.

  1. The system reads the first data block for the nonpartitioned table, and then reads the first data block of each noneliminated nonempty combined partition of the row-partitioned PI table into memory.
  2. The rows from the nonpartitioned data block are compared to the rows of each row-partitioned PI data block.

    The join routines present the rows in row hash order, but you can think of the process as visiting the data block for each combined partition in turn.

  3. As the rows of a data block are exhausted, the system reads the next data block for that combined partition.

    This results in each data block of each table being read only once. This is because of partitioning. Merge joins usually have to read some number of rows in one of the tables multiple times, either from disk or from cache. There is some additional overhead to manage the pool of data blocks, but join performance is not badly degraded when the window covers all noneliminated nonempty combined partitions.

If a non-trivial fraction of the combined partitions can be eliminated because of query conditions or because they are empty, overall performance can be improved, perhaps dramatically, over a traditional merge join, depending on the percentage of combined partitions that can be eliminated or are empty.

A limiting factor for the sliding-window merge join algorithm is the number of data blocks that can be contained in memory at the same time. The file system cache memory provides memory for the data blocks.

The DBS Control performance field PPICacheThrP controls memory usage for this purpose. The value is a number expressed in tenths of a percent of the file system cache available for each query step that needs a sliding window. If the cost profile constant PPICacheThrP is set to any value other than 0, then its setting overrides the setting of the PPICacheThrP DBS Control field. Only Teradata support personnel can modify the setting of the PPICacheThrP cost profile constant.

The default value for PPICacheThrP is 10, which represents 1%.

A significant degradation of join performance can occur when there are more noneliminated nonempty combined partitions than data block buffers. Assume enough memory has been allocated for 20 data blocks and a table with 100 partitions. In this case, the sliding window method is appropriate. The first 20 combined partitions are processed as they should be, and then the system reads the nonpartitioned table again as the join window slides down to combined partitions 21 through 40. A total of five passes through the nonpartitioned table are required, and, assuming the nonpartitioned table is roughly the same size as the row-partitioned PI table, the join can conceptually take five times longer than a join for which the window covers the entire table. The actual performance degradation is not as bad as a factor of five, because the output spool has exactly the same number of rows in either case, and each smaller window is more sparse with respect to the nonpartitioned table than a larger window would be. There can be an offsetting performance gain from cache usage by the nonpartitioned table, especially when it is fairly small.

An even more expensive situation occurs in the following cases:

  • Both tables are partitioned, but have different partitioning expressions
  • There are no join conditions specified on the partitioning columns

In both cases, there can be a sliding-window advancing through both tables. The EXPLAIN text for such a query neither indicates that a sliding-window merge join is being used, nor indicates the number of contexts used for each table.

Equation element Number Specified
LDR ntp logical data reads when neither table is partitioned.
LDR t2p logical data reads when the second table is partitioned.
LDR btp logical data reads when both tables are partitioned.
d 1 data blocks for the first table.
d 2 data blocks for the second table.
p 1 uneliminated combined partitions in the first table.
p 2 uneliminated combined partitions in the second table.
k 1 data blocks that can be contained in memory for the first table.
k 2 data blocks that can be contained in memory for the second table.
This allocation, in units of data blocks … Is this value …
minimum 8 or less, to match the number of noneliminated combined partitions, even if PPICacheThrP is set to 0.
maximum the smallest of the following:
  • 256
  • the largest number which does not exceed the percentage specified in the PPICacheThrP setting
  • the number of noneliminated combined partitions

The maximum allocation can never be less than the minimum allocation.

A larger number in PPICacheThrP allocates relatively more memory, when needed, to partition windowing steps, at the expense of other steps that might use cache for other purposes.

The default PPICacheThrP setting is low enough that steps unrelated to partition windowing are not likely to be short on cache. In theory, there could be a maximum of 50 concurrent sliding window joins on a given AMP, each operating on tables with many combined partitions, which could consume up to about 50% of the cache memory allocated for windowing. Even in that worst case scenario, the few nonpartitioned steps running on the AMP would have the remaining half of the cache.

A direct PI merge join requires equality conditions on all the primary index columns of the two relations.

A sliding-window merge join processes a set of populated row partitions against each of multiple sets of populated row partitions in the other relation. This stage of the process is followed by processing the next set of populated row partitions against each of multiple sets of populated partitions in the other relation. This process is repeated until all sets of populated row partitions have been joined with the other relation.

When joining two sets together, a sliding-window merge join operation handles the row sets logically as if they are in hash order. Only those row partitions retained after the row partition elimination process are considered for join processing, and the number of combined partitions processed is determined in the same manner as is used for the single-partition case.

The sliding windows are based on the row partitions defined by the combined partitioning expression. For a sliding-window merge join to be cost effective, only a limited number of populated combined partitions can participate in the join; however, multilevel row partitioning usually defines many combined partitions for the combined partitioning expressions. This means that the Optimizer does not often have the option of choosing to use a sliding-window merge join in a multilevel row partitioning situation.

The final cost of a sliding-window merge join is the merge join cost of a window pair multiplied by the number of window pairs involved in the join. The number of window pairs involved is a function of the number of combined partitions in the PPI relation set and the window size.

The Optimizer always estimates the cost of the join plans it evaluates to process a query; however, if an estimate differs from the actual data on the AMPs as it is revealed during processing, the system might then choose to substitute a sliding-window merge join. Alternatively, the query plan developed by the Optimizer might specify a sliding-window merge join, but in the final result, the AMP database software might instead dynamically reoptimize the request and use a single-window merge join if there are few populated row partitions (see Product Joins With Dynamic Row Partition Elimination).

In cases where there are conditions on a partitioning column that permit row partition elimination, the Optimizer uses the number of active row partitions rather than the total number of row partitions.

In cases where there is a range constraint between the partitioning column of a PI relation and the other table that can be used to generate a partition-level constraint, the AMP software applies dynamic row partition elimination to the operation.

Sliding-window merge joins are generally not feasible for join operations on a PI table when the number of combined partitions in a window is significantly fewer than the number of populated, noneliminated combined partitions that must be joined.

Sliding-window merge joins can be used for both non-character partitioned tables and for character-partitioned tables.

  • Dynamic Row Partition Elimination for Equality Join Terms on the Primary Index Columns of a Character-Partitioned Table:

    A merge join with dynamic row partition elimination is not supported for character-partitioned joins. The Optimizer never selects this join method as a direct join to a character-partitioned PI table.

  • Dynamic Row Partition Elimination for a Sliding-Window Merge Join Between a Primary-Indexed Table and a NoPI Table:

    The Optimizer cannot use dynamic row partition elimination for a sliding-window merge join between a primary-indexed table and a NoPI table without first spooling the NoPI table.

  • Dynamic Row Partition Elimination for a Sliding-Window Merge Join When One of the Relations is Column-Partitioned:

    Sliding-window merge join with dynamic partition elimination must first spool and then sort the column-partitioned relation when one of the relations is column-partitioned.

Example of a Sliding-Window Merge Join

In the following example, the 2 tables are joined on their primary indexes. The WHERE conditions eliminate all but 2 row partitions for partitioning level 1, and 15 row partitions for partitioning level 2 of orders. The WHERE conditions eliminate all but 1 row partition for partitioning level 1, and 19 row partitions for partitioning level 2 of lineitem.

After Teradata Database applies these conditions, it must join 30 combined partitions of the combined partitioning expression for orders to 19 combined partitions of the combined partitioning expression of lineitem, making a total of 49 combined partitions.

Assume the following properties for this specific query:

  • The remaining combined partitions are all populated
  • Only 18 combined partitions can be joined at a time based on the PPICacheThrP setting for your system

Because of this situation, Teradata Database is able to join the 2 tables using a sliding-window merge join, assuming the Optimizer estimates it to be the most cost effective join method.

Assume that the Optimizer decides that 8 combined partitions from orders are to be joined to 10 combined partitions from lineitem at a time.

Also assume that the Optimizer clusters the combined partitions from the respective tables into the following sets of windows for making the join.

  • The Optimizer divides the combined partitions of orders into 4 sets of 8, 8, 8, and 6 combined partitions.
  • The Optimizer divides the combined partitions of lineitem into 2 sets of combined partitions of 10 and 9 combined partitions.

Teradata Database directly merge joins each set of combined partitions from orders to each set of combined partitions from lineitem, making a total of 8 single-window merge joins.

The definition DDL text for the 2 tables, orders and lineitem, to be joined in this example query is as follows:

     CREATE TABLE orders (
       o_orderkey      INTEGER NOT NULL,
       o_custkey       INTEGER,
       o_orderstatus   CHARACTER(1) CASESPECIFIC,
       o_totalprice    DECIMAL(13,2) NOT NULL,
       o_orderdate     DATE FORMAT 'yyyy-mm-dd' NOT NULL,
       o_orderpriority CHARACTER(21),
       o_clerk         CHARACTER(16),
       o_shippriority  INTEGER,
       o_comment       VARCHAR(79))
     PRIMARY INDEX (o_orderkey)
     PARTITION BY (
     RANGE_N(o_custkey   BETWEEN 0
                         AND 49999
                         EACH 100),
     RANGE_N(o_orderdate BETWEEN DATE '2000-01-01'
                         AND     DATE '2006-12-31'
                         EACH INTERVAL '1' MONTH))
     UNIQUE INDEX (o_orderkey);
     CREATE TABLE lineitem (
       l_orderkey      INTEGER NOT NULL,
       l_partkey       INTEGER NOT NULL,
       l_suppkey       INTEGER,
       l_linenumber    INTEGER,
       l_quantity      INTEGER NOT NULL,
       l_extendedprice DECIMAL(13,2) NOT NULL,
       l_discount      DECIMAL(13,2),
       l_tax           DECIMAL(13,2),
       l_returnflag    CHARACTER(1),
       l_linestatus    CHARACTER(1),
       l_shipdate      DATE FORMAT 'yyyy-mm-dd',
       l_commitdate    DATE FORMAT 'yyyy-mm-dd',
       l_receiptdate   DATE FORMAT 'yyyy-mm-dd',
       l_shipinstruct  VARCHAR(25),
       l_shipmode      VARCHAR(10),
       l_comment       VARCHAR(44))
     PRIMARY INDEX (l_orderkey)
     PARTITION BY (
     RANGE_N(l_suppkey  BETWEEN 0
                        AND  4999
                        EACH 10),
     RANGE_N(l_shipdate BETWEEN DATE '2000-01-01' 
                        AND     DATE '2006-12-31'
                        EACH INTERVAL '1' MONTH));

Given these table definitions, the Optimizer selects the direct sliding-window merge join method for the join plan to join the 49 qualifying combined partitions when it processes the following query:

     SELECT *
     FROM orders INNER JOIN lineitem
     WHERE o_orderkey = l_orderkey
     AND   o_orderdate BETWEEN DATE '2005-04-01'
                       AND     DATE '2006-06-30'
     AND   o_custkey IN (618, 973)
     AND   l_shipdate  BETWEEN DATE '2005-04-01'
                       AND     DATE '2006-10-31'
     AND   l_suppkey = 4131;