An aggregate join index is a database object created using the CREATE JOIN INDEX statement, but specifying one or more columns that are derived from an aggregate expression. An aggregate join index is a join index that specifies MIN, MAX, SUM, COUNT, or that extracts a DATE value aggregate operations. No other aggregate functions are permitted in the definition of a join index; however, most of the other simple aggregate functions can be derived from these using column expressions. You can create aggregate join indexes as either single-table or as multitable join indexes. Aggregate join indexes can also be sparse (see Sparse Join Indexes).
Functions of Aggregate Join Indexes
The primary function of an aggregate join index is to provide the Optimizer with a high-performing, cost-effective means for satisfying any query that specifies a frequently made aggregation operation on one or more columns.
In other words, aggregate join indexes permit you to define a persistent summary table without violating the normalization of the database schema. This allows a join index to precompute an aggregate value that would otherwise potentially require a table scan and sort operation.
Aggregate join indexes can be especially helpful for queries that roll up values for dimensions other that the primary key dimension, which would otherwise require redistribution.
An aggregate join index can be used to cover aggregate queries that only consider a subset of groups contained in the join index or have more join tables than the join index. In order to allow the aggregate join index to be used in this way, its definition must satisfy the following conditions:
- The grouping clause must include all columns that are specified in the grouping clause of the query.
- All columns in the query WHERE clause that join to tables not in the aggregate join index must be part of the join index definition.
- If you define row partitioning for an aggregate join index, its partitioning columns must be members of the column set specified in the GROUP BY clause of the index definition.
In other words, you cannot specify an aggregated column as a partitioning column.
- An aggregate join index cannot be column partitioned.
An aggregate join index can also be used to cover the following:
- Requests that specify COUNT(DISTINCT) and extended grouping such as CUBE, ROLLUP, and GROUPING SETS.
- Requests that specify subqueries or spooled derived tables.
- Both the outer and the inner tables in a request that specifies an outer join.
Other functionally similar strategies for solving this problem can also be used. In general, only prototyping can determine which among the possible choices is best for a particular application environment and hardware configuration.
- You can create a global temporary table definition and then populate a materialized instance of it with aggregated result sets. This is not so much an alternate strategy as it is an entirely different strategy designed for an entirely different application.
Global temporary tables are private to the session that materializes them and their data does not persist beyond the end of that session. Unless you specify the ON COMMIT PRESERVE option when the definition for the temporary table is created, its contents do not persist even across individual database transactions. Unless provisions are made to write their contents to a persistent base table when a session ends, their data is not saved. Without numerous safeguards built into the process, this is not a method that provides any assurances about the integrity of the data it produces because while the contents of any global temporary table are private to the session in which it is created, the definition of the table is global within a database and any number of different sessions and users could materialize a different version of the table, populate it, and write the results to the same base table as any other session.
Note that the containing database or user for a global temporary table must have a minimum of 512 bytes of available PERM space to contain the table header for the GTT.
- You can create a volatile table with aggregate expressions defined on some or all its columns. The drawbacks of global temporary tables for this application apply equally to volatile tables.
- You can create a denormalized base table and populate it with an aggregated result set.
Denormalization always reduces the generality of the database for any ad hoc queries or data mining operations you might want to undertake as well as introducing various problematic update anomalies. While a relatively mild degree of denormalization is standard in physically implemented databases, the sort of denormalization called for by this solution is probably beyond what most DBAs would find acceptable, the enthusiasm of dimensional modeling theorists notwithstanding.
Because there is no mechanism for keeping such a table synchronized with its base table, it can become quickly outdated.
Nonetheless, for some applications this approach might be the only high-performing solution.
This example shows how a simple aggregate join index can be used to create prejoins with aggregation on one or more of its columns while retaining full normalization of the physical database. Strictly speaking, the term normalization applies only to the logical schema for a database, not to its physical schema; however, the term has unfortunately come to be used equally for both logical and physical database schemas. See The Normalization Process for details.
This example set uses the following table definitions.
CREATE TABLE customer ( c_custkey INTEGER NOT NULL, c_name CHARACTER(26) CASESPECIFIC NOT NULL, c_address VARCHAR(41), c_nationkey INTEGER, c_phone CHARACTER(16), c_acctbal DECIMAL(13,2), c_mktsegment CHARACTER(21), c_comment VARCHAR(127)) UNIQUE PRIMARY INDEX (c_custkey); CREATE TABLE orders ( o_orderkey INTEGER NOT NULL, o_custkey INTEGER, o_orderstatus CHARACTER(1) CASESPECIFIC, o_totalprice DECIMAL(13,2) NOT NULL, o_orderdate DATE FORMAT 'YYYY-MM-DD' NOT NULL, o_orderpriority CHARACTER(21), o_clerk CHARACTER(16), o_shippriority INTEGER, o_comment VARCHAR(79)) UNIQUE PRIMARY INDEX (o_orderkey);
Example Query Statement
Consider the following aggregate join query.
SELECT COUNT(*), SUM(o_totalprice) FROM orders, customer WHERE o_custkey = c_custkey AND o_orderdate > DATE ‘2004-09-20’ AND o_orderdate < DATE ‘2004-10-15’ GROUP BY c_nationkey;
Query Plan: No Aggregate Join Index Defined
Without an aggregate join index, a typical execution plan for this query might involve the following stages:
- Redistribute orders into spool.
- Sort the spool on o_custkey .
- Merge join the sorted spool and the customer file.
- Aggregate the result of the merge join.
Aggregate Join Index Definition
Suppose you define the following aggregate join index, which aggregates o_totalprice over a join of orders and customer .
CREATE JOIN INDEX ord_cust_idx AS SELECT c_nationkey, SUM(o_totalprice(FLOAT)) AS price, o_orderdate FROM orders, customer WHERE o_custkey = c_custkey GROUP BY c_nationkey, o_orderdate ORDER BY o_orderdate;
Query Plan for Aggregate Join Index
The execution plan produced by the Optimizer for this query includes an aggregate step on the aggregate join index, which is much smaller than either one of the join tables. You can confirm this by performing an EXPLAIN on the query.
EXPLAIN SELECT COUNT(*), SUM(o_totalprice) FROM orders, customer WHERE o_custkey = c_custkey AND o_orderdate > DATE ‘2005-09-20’ AND o_orderdate < DATE ‘2005-10-15’ GROUP BY c_nationkey; *** Help information returned. 18 rows. *** Total elapsed time was 3 seconds. Explanation --------------------------------------------------------------------- 1) First, we lock a distinct TPCD."pseudo table" for read on a RowHash to prevent global deadlock for TPCD.ord_cust_idx. 2) Next, we lock TPCD.ord_cust_idx for read. 3) We do a SUM step to aggregate from join index table TPCD.ordcustidx by way of an all-rows scan with a condition of ( "(TPCD.ord_cust_idx.O_ORDERDATE > DATE '2005-09-20') AND (TPCD.ord_cust_idx.O_ORDERDATE < DATE '2005-10-15')"), and the grouping identifier in field 1. Aggregate Intermediate Results are computed globally, then placed in Spool 2. The size of Spool 2 is estimated to be 1 row. 4) We do an all-AMPs RETRIEVE step from Spool 2 (Last Use) by way of an all-rows scan into Spool 1, which is built locally on the AMPs. The size of Spool 1 is estimated with no confidence to be 1 row. The estimated time for this step is 0.17 seconds. 5) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1.
Aggregate Join Index With EXTRACT Function
Join index definitions, both simple and aggregate, support the EXTRACT function. This example illustrates the use of the EXTRACT function in the definition of an aggregate join index.
Aggregate Join Index Definition
The index is defined as follows.
CREATE JOIN INDEX ord_cust_idx_2 AS SELECT c_nationkey, SUM(o_totalprice(FLOAT)) AS price, EXTRACT(YEAR FROM o_orderdate) AS o_year FROM orders, customer WHERE o_custkey = c_custkey GROUP BY c_nationkey, o_year ORDER BY o_year;
The aggregation is based only on the year of o_orderdate, which has fewer groups than the entire o_orderdate , so ord_cust_idx_2 is much smaller than ord_cust_idx .
On the other hand, the use for ord_cust_idx_2 is more limited than ord_custidx . In particular, ord_cust_idx_2 can only be used to satisfy queries that select full years of orders.
Example Query Statement
While the join index defined for this example, ord_cust_idx_2 , cannot be used for the query analyzed in Query Plan for Aggregate Join Index above, the following query does profit from its use because dates are on year boundaries.
SELECT COUNT(*), SUM(o_totalprice) FROM orders, customer WHERE o_custkey = c_custkey AND o_orderdate > DATE ‘2004-01-01’ AND o_orderdate < DATE ‘2004-12-31’ GROUP BY c_nationkey;