Aggregate Join Indexes | Database Design | Teradata Vantage - Aggregate Join Indexes - Advanced SQL Engine

Aggregate Join Indexes | Database Design | Teradata Vantage - Aggregate Join Indexes - Advanced SQL Engine - Teradata Database

Database Design

Product

Advanced SQL Engine

Teradata Database

Release Number

17.10

Published

July 2021

Language

English (United States)

Last Update

2021-07-27

dita:mapPath

kko1591750222108.ditamap

dita:ditavalPath

kko1591750222108.ditaval

dita:id

B035-1094

lifecycle

Product Category

Teradata Vantage™

An aggregate join index is a database object created using the CREATE JOIN INDEX statement, but specifying one or more columns that are derived from an aggregate expression. An aggregate join index is a join index that specifies MIN, MAX, SUM, COUNT, or that extracts a DATE value aggregate operations. No other aggregate functions are permitted in the definition of a join index; however, most of the other simple aggregate functions can be derived from these using column expressions. You can create aggregate join indexes as either single-table or as multitable join indexes. Aggregate join indexes can also be sparse (see Sparse Join Indexes).

Functions of Aggregate Join Indexes

The primary function of an aggregate join index is to provide the Optimizer with a high-performing, cost-effective means for satisfying any query that specifies a frequently made aggregation operation on one or more columns.

In other words, aggregate join indexes permit you to define a persistent summary table without violating the normalization of the database schema. This allows a join index to precompute an aggregate value that would otherwise potentially require a table scan and sort operation.

Aggregate join indexes can be especially helpful for queries that roll up values for dimensions other that the primary key dimension, which would otherwise require redistribution.

An aggregate join index can be used to cover aggregate queries that only consider a subset of groups contained in the join index or have more join tables than the join index. In order to allow the aggregate join index to be used in this way, its definition must satisfy the following conditions:

The grouping clause must include all columns that are specified in the grouping clause of the query.
All columns in the query WHERE clause that join to tables not in the aggregate join index must be part of the join index definition.
If you define row partitioning for an aggregate join index, its partitioning columns must be members of the column set specified in the GROUP BY clause of the index definition.
In other words, you cannot specify an aggregated column as a partitioning column.
An aggregate join index cannot be column partitioned.

An aggregate join index can also be used to cover the following:

Requests that specify COUNT(DISTINCT) and extended grouping such as CUBE, ROLLUP, and GROUPING SETS.
Requests that specify subqueries or spooled derived tables.
Both the outer and the inner tables in a request that specifies an outer join.

Related Strategies

Other functionally similar strategies for solving this problem can also be used. In general, only prototyping can determine which among the possible choices is best for a particular application environment and hardware configuration.

You can create a global temporary table definition and then populate a materialized instance of it with aggregated result sets. This is not so much an alternate strategy as it is an entirely different strategy designed for an entirely different application.
Global temporary tables are private to the session that materializes them and their data does not persist beyond the end of that session. Unless you specify the ON COMMIT PRESERVE option when the definition for the temporary table is created, its contents do not persist even across individual database transactions. Unless provisions are made to write their contents to a persistent base table when a session ends, their data is not saved. Without numerous safeguards built into the process, this is not a method that provides any assurances about the integrity of the data it produces because while the contents of any global temporary table are private to the session in which it is created, the definition of the table is global within a database and any number of different sessions and users could materialize a different version of the table, populate it, and write the results to the same base table as any other session.

Note that the containing database or user for a global temporary table must have a minimum of 512 bytes of available PERM space to contain the table header for the GTT.
You can create a volatile table with aggregate expressions defined on some or all its columns. The drawbacks of global temporary tables for this application apply equally to volatile tables.
You can create a denormalized base table and populate it with an aggregated result set.
Denormalization always reduces the generality of the database for any ad hoc queries or data mining operations you might want to undertake as well as introducing various problematic update anomalies. While a relatively mild degree of denormalization is standard in physically implemented databases, the sort of denormalization called for by this solution is probably beyond what most DBAs would find acceptable, the enthusiasm of dimensional modeling theorists notwithstanding.

Because there is no mechanism for keeping such a table synchronized with its base table, it can become quickly outdated.

Nonetheless, for some applications this approach might be the only high-performing solution.

Example

This example shows how a simple aggregate join index can be used to create prejoins with aggregation on one or more of its columns while retaining full normalization of the physical database. Strictly speaking, the term normalization applies only to the logical schema for a database, not to its physical schema; however, the term has unfortunately come to be used equally for both logical and physical database schemas. See The Normalization Process for details.

Table Definitions

This example set uses the following table definitions.

     CREATE TABLE customer (
       c_custkey    INTEGER NOT NULL,
       c_name       CHARACTER(26) CASESPECIFIC NOT NULL,
       c_address    VARCHAR(41),
       c_nationkey  INTEGER,
       c_phone      CHARACTER(16),
       c_acctbal    DECIMAL(13,2),
       c_mktsegment CHARACTER(21),
       c_comment    VARCHAR(127))
     UNIQUE PRIMARY INDEX (c_custkey);

     CREATE TABLE orders (
       o_orderkey      INTEGER NOT NULL,
       o_custkey       INTEGER,
       o_orderstatus   CHARACTER(1) CASESPECIFIC,
       o_totalprice    DECIMAL(13,2) NOT NULL,
       o_orderdate     DATE FORMAT 'YYYY-MM-DD' NOT NULL,
       o_orderpriority CHARACTER(21),
       o_clerk         CHARACTER(16),
       o_shippriority  INTEGER,
       o_comment       VARCHAR(79))
     UNIQUE PRIMARY INDEX (o_orderkey);

Example Query Statement

Consider the following aggregate join query.

     SELECT COUNT(*), SUM(o_totalprice)
     FROM orders, customer
     WHERE o_custkey = c_custkey
     AND   o_orderdate > DATE ‘2004-09-20’
     AND   o_orderdate < DATE ‘2004-10-15’
     GROUP BY c_nationkey;

Query Plan: No Aggregate Join Index Defined

Without an aggregate join index, a typical execution plan for this query might involve the following stages:

Redistribute orders into spool.
Sort the spool on o_custkey.
Merge join the sorted spool and the customer file.
Aggregate the result of the merge join.

Aggregate Join Index Definition

Suppose you define the following aggregate join index, which aggregates o_totalprice over a join of orders and customer.

     CREATE JOIN INDEX ord_cust_idx AS
      SELECT c_nationkey, SUM(o_totalprice(FLOAT))
             AS  price, o_orderdate
      FROM orders, customer
      WHERE o_custkey = c_custkey
      GROUP BY c_nationkey, o_orderdate
      ORDER BY o_orderdate;

Query Plan for Aggregate Join Index

The execution plan produced by the Optimizer for this query includes an aggregate step on the aggregate join index, which is much smaller than either one of the join tables. You can confirm this by performing an EXPLAIN on the query.

    EXPLAIN SELECT COUNT(*), SUM(o_totalprice)
            FROM orders, customer
            WHERE o_custkey = c_custkey
            AND   o_orderdate > DATE ‘2005-09-20’
            AND   o_orderdate < DATE ‘2005-10-15’
            GROUP BY c_nationkey;

    *** Help information returned. 18 rows.
     *** Total elapsed time was 3 seconds.

   Explanation
   ---------------------------------------------------------------------
     1) First, we lock TPCD.ord_cust_idx for read on a
        reserved RowHash to prevent a global deadlock.
     2) Next, we do a SUM step to aggregate from join index table
        TPCD.ordcustidx by way of an all-rows scan with a condition of (
        "(TPCD.ord_cust_idx.O_ORDERDATE > DATE '2005-09-20') AND
        (TPCD.ord_cust_idx.O_ORDERDATE < DATE '2005-10-15')"), and the
        grouping identifier in field 1. Aggregate Intermediate Results
        are computed globally, then placed in Spool 2. The size of Spool
        2 is estimated to be 1 row.
     3) We do an all-AMPs RETRIEVE step from Spool 2 (Last Use) by way of
        an all-rows scan into Spool 1, which is built locally on the AMPs.
        The size of Spool 1 is estimated with no confidence to be 1 row.
        The estimated time for this step is 0.17 seconds.
     4) Finally, we send out an END TRANSACTION step to all AMPs involved
        in processing the request.
   -> The contents of Spool 1 are sent back to the user as the result of
      statement 1.

Aggregate Join Index With EXTRACT Function

Join index definitions, both simple and aggregate, support the EXTRACT function. This example illustrates the use of the EXTRACT function in the definition of an aggregate join index.

Aggregate Join Index Definition

The index is defined as follows.

     CREATE JOIN INDEX ord_cust_idx_2 AS
       SELECT c_nationkey, SUM(o_totalprice(FLOAT)) AS  price, 
              EXTRACT(YEAR FROM o_orderdate) AS o_year
       FROM orders, customer
       WHERE o_custkey = c_custkey
       GROUP BY c_nationkey, o_year
       ORDER BY o_year;

The aggregation is based only on the year of o_orderdate, which has fewer groups than the entire o_orderdate, so ord_cust_idx_2 is much smaller than ord_cust_idx.

On the other hand, the use for ord_cust_idx_2 is more limited than ord_custidx. In particular, ord_cust_idx_2 can only be used to satisfy queries that select full years of orders.

Example Query Statement

While the join index defined for this example, ord_cust_idx_2, cannot be used for the query analyzed in Query Plan for Aggregate Join Index above, the following query does profit from its use because dates are on year boundaries.

     SELECT COUNT(*), SUM(o_totalprice)
     FROM orders, customer
     WHERE o_custkey = c_custkey
     AND   o_orderdate > DATE ‘2004-01-01’
     AND   o_orderdate < DATE ‘2004-12-31’
     GROUP BY c_nationkey;