15.00 - Rules and Restrictions for PPI Join Indexes - Teradata Database

Teradata Database SQL Data Definition Language Detailed Topics

Product
Teradata Database
Release Number
15.00
Content Type
Programming Reference
Publication ID
B035-1184-015K
Language
English (United States)

Rules and Restrictions for PPI Join Indexes

Join indexes can improve query response times in the following ways:

  • Multitable join indexes prejoin tables so that the join result is fetched from the join index directly to answer applicable queries instead of calculating the joins dynamically.
  • Single‑table join indexes effectively redistribute the rows in their underlying base table by choosing a different partitioning or primary index than the base table to make the joins between the join index and other tables or indexes more efficient.
  • You can also use a single‑table join index to establish an alternate access opportunity.

  • Aggregate join indexes preaggregate results so they can be used to answer aggregation queries directly.
  • Sparse join indexes store a subset of the rows from their underlying base table set based on a WHERE clause condition. This makes it possible for a smaller join index to be used in answering queries when applicable.
  • Note that a join index can be a combination of the individual types, such as a join index that has all of the following properties.

  • Sparse
  • Aggregate
  • Multitable
  • You can create a partitioned primary index for any kind of uncompressed join index. Therefore, a PPI join index can improve query performance by leveraging both join index and PPI benefits. The following two scenarios indicate some possibilities in which an uncompressed join index with a PPI might be beneficial.

    Scenario A: Assume a star schema with the following tables in the physical data model:

     

        Dimensional Model Table Type

                  Corresponding Physical Table and Column Sets

    Fact

    sales

  • sale_date
  • store_id
  •  

  • prod_id
  • amount
  • Dimension

     

     

     

  • calendar
  • dayofmth
  • mth
  •  

  • yr
  • PRIMARY INDEX(yr, mth)

  • product
  • prod_id
  •  

  • prod_category
  • org
  • store_id
  • area
  •  

  • division
  • business_unit
  • You might want to create a single‑level PPI aggregate join index with daily summary data to answer some ad hoc queries for this database schema such as the following:

         CREATE JOIN INDEX sales_summary AS
           SELECT sale_date, prod_category, division, 
                  SUM(amount) AS daily_sales
           FROM calendar AS c, product AS p, org AS o, sales AS s
           WHERE c.dayofmth = s.sale_date
           AND   p.prod_id = s.prod_id
           AND   o.store_id = s.store_id 
           GROUP BY sale_date, prod_category, division
         PRIMARY INDEX(sale_date, prod_category, division)
         PARTITION BY RANGE_N(sale_date BETWEEN DATE '1990-01-01' 
                                        AND     DATE '2005-12-31' 
                                        EACH INTERVAL '1' MONTH);

    A wide range of queries can make use of this join index, especially when there exists a foreign key‑to‑primary key relationship between the fact table and the dimension tables that enables the join index to be used broadly to cover queries on a subset of dimension tables (see “Query Coverage by Join Indexes” on page 382, “Restriction on Coverage by Join Indexes When a Join Index Definition References More Tables Than a Query” on page 386, and “Rules for Whether Join Indexes With Extra Tables Cover Queries” on page 390. Also see Database Design and SQL Request and Transaction Processing for information about broad join indexes).

    A significant performance gain is achieved with this PPI join index when it is used to answer the queries that have equality or range constraints on the sale_date column.

    Optimizations for PPI base tables, such as row partition elimination, are applied to a join index in the same way as they are applied to a base table.

    For example, the execution of the following query only needs to access 12 out of 192 partitions, which saves up to 93.75 percent on disk reads, with proportional elapsed time reductions compared with a corresponding unpartitioned join index, which must perform a full‑table scan to respond to the same query:

         EXPLAIN SELECT prod_category, SUM(amount) AS daily_sales
         FROM calendar AS c, product AS p, sales AS s
         WHERE c.dayofmth = s.sale_date
         AND   p.prod_id = s.prod_id 
         AND   sale_date BETWEEN DATE '2005-01-01' AND DATE '2005-12-31'
         GROUP BY prod_category;

    Important phrases in the following report are highlighted in boldface type.

         Explanation
         ----------------------------------------------------------------------
         1) First, we lock a distinct HONG."pseudo table" for read on a
            RowHash to prevent global deadlock for HONG.SALES_SUMMARY.
         2) Next, we lock HONG.SALES_SUMMARY for read.
         3) We do an all-AMPs SUM step to aggregate from 12 partitions of
            HONG.SALES_SUMMARY with a condition of (
            "(HONG.SALES_SUMMARY.sale_date >= DATE '2005-01-01') AND
            ((HONG.SALES_SUMMARY.sale_date <= DATE '2005-12-31') AND
            ((HONG.SALES_SUMMARY.sale_date >= DATE '2005-01-01') AND
            (HONG.SALES_SUMMARY.sale_date <= DATE '2005-12-31')))"), and the
            grouping identifier in field 1.  Aggregate Intermediate Results
            are computed globally, then placed in Spool 3.  The size of Spool
            3 is estimated with no confidence to be 1 row.  The estimated time
            for this step is 0.04 seconds.
         4) We do an all-AMPs RETRIEVE step from Spool 3 (Last Use) by way of
            an all-rows scan into Spool 1 (group_amps), which is built locally
            on the AMPs.  The size of Spool 1 is estimated with no confidence
            to be 1 row.  The estimated time for this step is 0.04 seconds.
         5) Finally, we send out an END TRANSACTION step to all AMPs involved
            in processing the request.
         -> The contents of Spool 1 are sent back to the user as the result of
            statement 1.  The total estimated time is 0.07 seconds.

    Scenario B: With the same schema used in Scenario A, you might sometimes choose to create a single‑level PPI single‑table join index on the fact table that includes the foreign key of one or more dimension tables so it can be used to join to the corresponding dimension table set and roll up to higher aggregate levels. The following is an example of one such single‑table join index:

         CREATE JOIN INDEX sji AS 
           SELECT sale_date, prod_id, SUM(amount) AS sales_amount
           FROM Sales
           GROUP BY sale_date, prod_id
         PRIMARY INDEX(sale_date, prod_id)
         PARTITION BY RANGE_N(sale_date BETWEEN DATE '1990-01-01' 
                                        AND     DATE '2005-12-31' 
                                        EACH INTERVAL '1' MONTH);

    For a query asking for the total sales amount at the month level, the Optimizer can join this join index to the calendar table and roll up:

         EXPLAIN SELECT yr, mth, SUM(amount)
         FROM sales, Calendar
         WHERE sale_date=dayofmth
         AND sale_date BETWEEN DATE '2005-01-01' AND DATE '2005-12-31'
         GROUP BY yr, mth;

    Important phrases in the following report are highlighted in boldface type.

         Explanation
         ----------------------------------------------------------------------
           1) First, we lock a distinct HONG."pseudo table" for read on a
              RowHash to prevent global deadlock for HONG.Calendar.
           2) Next, we lock a distinct HONG."pseudo table" for read on a RowHash
              to prevent global deadlock for HONG.SJI.
           3) We lock HONG.Calendar for read, and we lock HONG.SJI for read.
           4) We execute the following steps in parallel.
                1) We do an all-AMPs RETRIEVE step from HONG.Calendar by way of
                   an all-rows scan with a condition of (
                   "(HONG.Calendar.dayofmth <= DATE '2005-12-31') AND
                   (HONG.Calendar.dayofmth >= DATE '2005-01-01')") into Spool 4
                   (all_amps), which is duplicated on all AMPs.  Then we do a
                   SORT to order Spool 4 by row hash.  The size of Spool 4 is
                   estimated with no confidence to be 2 rows.  The estimated
                   time for this step is 0.01 seconds.
                2) We do an all-AMPs RETRIEVE step from 12 partitions of
                   HONG.SJI with a condition of ("(HONG.SJI.sale_date >= DATE
                   '2005-01-01') AND ((HONG.SJI.sale_date <= DATE '2005-12-31')
                   AND (NOT (HONG.SJI.sale_date IS NULL )))") into Spool 5
                   (all_amps), which is built locally on the AMPs.  Then we do a
                   SORT to order Spool 5 by row hash.  The size of Spool 5 is
                   estimated with no confidence to be 1 row.  The estimated time
                   for this step is 0.00 seconds.
         5) We do an all-AMPs JOIN step from Spool 4 (Last Use) by way of a
            RowHash match scan, which is joined to Spool 5 (Last Use) by way
            of a RowHash match scan.  Spool 4 and Spool 5 are joined using a
            merge join, with a join condition of ("sale_date = dayofmth").
            The result goes into Spool 3 (all_amps), which is built locally on
            the AMPs.  The size of Spool 3 is estimated with no confidence to
            be 1 row.  The estimated time for this step is 0.04 seconds.
         6) We do an all-AMPs SUM step to aggregate from Spool 3 (Last Use) by
            way of an all-rows scan, and the grouping identifier in field 1.
            Aggregate Intermediate Results are computed globally, then placed
            in Spool 6.  The size of Spool 6 is estimated with no confidence
            to be 1 row.  The estimated time for this step is 0.05 seconds.
         7) We do an all-AMPs RETRIEVE step from Spool 6 (Last Use) by way of
            an all-rows scan into Spool 1 (group_amps), which is built locally
            on the AMPs.  The size of Spool 1 is estimated with no confidence
            to be 1 row.  The estimated time for this step is 0.04 seconds.
         8) Finally, we send out an END TRANSACTION step to all AMPs involved
            in processing the request.
         -> The contents of Spool 1 are sent back to the user as the result of
            statement 1.  The total estimated time is 0.13 seconds.

    Notice that the row partition elimination is applied to sji, which reduces the number of rows accessed to be 12 out of 192 row partitions, making the join much faster than it would be with an otherwise identical NPPI single‑table join index, which requires a full‑table scan. Teradata Database can apply other PPI‑related join optimizations, such as dynamic row partition elimination and rowkey‑based merge join, to PPI join indexes just as they are applied to PPI base tables when they are found to be applicable.

    While a join index can greatly improve the query response time, you must also consider the overhead of maintaining it when inserts, deletions, and updates are made to the base table on which the index is defined. An uncompressed PPI join index might experience significant improvement in its maintenance performance due to PPI‑related optimizations. For example, for the join indexes defined in scenarios A and B, Teradata Database can apply row partition elimination when there is an equality or range constraint on the row partitioning column in the data maintenance statement.

    For example:

         EXPLAIN DELETE sales 
                 WHERE sale_date BETWEEN DATE '1999-01-01' 
                                 AND     DATE '1999-12-31';

    Important phrases in the following report are highlighted in boldface type.

         Explanation
         ------------------------------------
           1) First, we lock a distinct HONG."pseudo table" for write on a
              RowHash to prevent global deadlock for HONG.SALES_SUMMARY.
           2) Next, we lock a distinct HONG."pseudo table" for write on a
              RowHash to prevent global deadlock for HONG.SJI.
           3) We lock a distinct HONG."pseudo table" for write on a RowHash to
              prevent global deadlock for HONG.sales.
           4) We lock HONG.SALES_SUMMARY for write, we lock HONG.SJI for write,
              and we lock HONG.sales for write.
           5) We execute the following steps in parallel.
                1) We do an all-AMPs DELETE from 12 partitions of
                   HONG.SALES_SUMMARY with a condition of (
                   "(HONG.SALES_SUMMARY.sale_date >= DATE '1999-01-01') AND
                   (HONG.SALES_SUMMARY.sale_date <= DATE '1999-12-31')").
                2) We do an all-AMPs DELETE from 12 partitions of HONG.SJI with
                   a condition of ("(HONG.SJI.sale_date >= DATE '1999-01-01')
                   AND (HONG.SJI.sale_date <= DATE '1999-12-31')").
                3) We do an all-AMPs DELETE from HONG.sales by way of an
                   all-rows scan with a condition of ("(HONG.sales.sale_date >=
                   DATE '1999-01-01') AND (HONG.sales.sale_date <= DATE
                   '1999-12-31')").
           6) Finally, we send out an END TRANSACTION step to all AMPs involved
              in processing the request.
           -> No rows are returned to the user as the result of statement 1.

    Partitioning on the DATE column in the join index helps insert performance if the transaction data is inserted into sales according to a time sequence. Note that MultiLoad and FastLoad are not supported for base tables with join indexes, so other batch loading methods, such as Teradata Parallel Data Pump or FastLoad into a staging table followed by either an INSERT … SELECT or MERGE batch request with error logging must be used to load data into sales.

    Because the inserted rows are clustered in the data blocks corresponding to the appropriate partitions, the number of data blocks the system must read and write is reduced compared with the NPPI join index case where the inserted rows scatter among all the data blocks.

    Be aware that you cannot drop a join index to enable MultiLoad or FastLoad batch loads until any requests that access that index complete processing. Requests place READ locks on any join indexes they access, and Teradata Database defers processing of any DROP JOIN INDEX requests against locked indexes until their READ locks have all been released.

    It is also possible for the maintenance of a PPI join index to be less efficient than an otherwise identical NPPI join index. For example, if the primary index of the join index does not include the partitioning column set, and a DELETE or UPDATE statement specifies a constraint on the primary index, the maintenance of the PPI join index must probe all partitions within the AMP as opposed to the fast primary index access for the NPPI join index.