Table Structure and Update Schedule
Assume that a retail enterprise has a large, nonpartitioned primary index sales table containing the details of each transaction for the previous 24 full months plus the current month-to-date. Once a month, the transactions from the oldest month are deleted from the table. Current transactions are loaded into the table nightly using Teradata Parallel Transporter. Most transactions are added on the date they occur, but a small percentage of transactions might be reported a few days after they occur. The number of transactions per month is roughly the same for all months.
Each row contains, among other things, the product code for the item, the transaction date, an identifier for the sales agent, and the quantity sold. The rows are short, and the data blocks are large. The primary index is a composite of product code, transaction date, and the agent identifier.
Query Workload
The query workload is a mix of tactical and strategic requests:
- There is a modest volume of short-running single-AMP (primary index) queries.
- There are many ad hoc queries follow a general pattern of comparing current-month-to-date sales to the same days of the previous month, or to the same days of the same month of the previous year for a few product code values.
- Some queries analyze agent performance, usually over an interval of a calendar quarter or less.
- Some queries examine sales trends over the previous 24 full months, usually for most or all product code values.
All the tables that support the workload have different primary indexes.
The sales table is frequently joined to relatively small tables containing information about each product code and each sales agent.
Problem Statement
The current definition of sales_table does not use row partitioning.
CREATE TABLE sale_stable ( product_code CHARACTER(8), sales_date DATE, agent_id CHARACTER(8), quantity_sold INTEGER, product_description VARCHAR(50)) PRIMARY INDEX (product_code,sales_date,agent_id);
The DBA has been told to speed up the ad hoc queries and agent analysis queries. He considers two possible optimizations, neither of which uses row partitioning.
- Define either a UPI or a join index on the transaction date column.
The DBA then sets up tests for both scenarios. Unfortunately, the EXPLAIN reports show that the optimizer finds neither index to be selective enough to improve performance over a full-table scan, and does not use them.
- Split the table into 25 separate tables, each containing transactions for a calendar month, and then define a view that UNIONs all the tables. This view is intended to be used by the applications that analyze 24 months of sales history.
After some analysis, he concludes that this solution could indeed speed up the targeted queries, but that it also adds too much complexity for his end users, who would now have to understand the view structure and change the table names in their queries, code more complicated UNION statements, and select appropriate date and product code ranges. The requirement to know the right table name also applies to the short-running single-AMP queries that specify primary index values. This proposed solution also complicates the nightly load jobs, especially in the first few days of a month when a small number of the transactions would be from the prior month, as well as complicating his long successful archive strategy. The DBA ultimately rejects this alternative as being too complicated and error-prone.
With these negative results in hand, the DBA next considers redefining the sales table with row partitioning.
Analysis of Row Partitioning Benefits
Thinking that row partitioning might be an easy way to optimize his workloads for this situation, the DBA converts the sales table into a partitioned table partitioned by transaction month and finds that many of his critical queries run faster with no significant negative tradeoffs.
The DDL for the redefined sales table looks like this:
CREATE TABLE ppi_sales_table ( product_code CHARACTER(8), sales_date DATE, agent_id CHARACTER(8), quantity_sold INTEGER, product_description VARCHAR(50)) PRIMARY INDEX (product_code,sales_date,agent_id) PARTITION BY RANGE_N(sales_date BETWEEN DATE ’2001-06-01’ AND DATE ’2003-06-31’ EACH INTERVAL ’1’ MONTH);
The following description examines each element of the workload as it relates to the newly redefined sales table, ppi_sales_table.
- The monthly deletes are faster because instead of a monthly Teradata Parallel Transporter delete job, the DBA can now perform a simple monthly ALTER TABLE statement to do the following things:
- Drop the entire partition that contains the oldest data.
- Create a few new partitions to hold data for the next few future months.
Deleting all the rows in a partition is optimized in much the same way that deleting all the rows in a nonpartitioned primary index table is optimized. For example.
- There is no need to record the individual rows in the transient journal as they are deleted.
- The rows for the month being deleted are stored contiguously on each AMP instead of being scattered more or less evenly among all the data blocks of the table, so there are fewer blocks to read.
- Most of the deletes are full-block deletes, so the block does not need to be rewritten.
- There is no need to touch any of the rows for the other months.
Dropping the oldest partition set is a nearly instantaneous operation, assuming there are no USIs or join indexes that must be updated.
The DDL for this ALTER TABLE statement looks like this for a selected monthly update:
ALTER TABLE sales_table MODIFY PRIMARY INDEX (product_code, sales_date, agent_id) DROP RANGE BETWEEN DATE ’2001-06-01’ AND DATE ’2001-06-30’ ADD RANGE BETWEEN DATE ’2003-07-01’ AND DATE ’2003-07-31’ WITH DELETE;
- The nightly Teradata Parallel Transporter insert job should run somewhat faster than before. Instead of the inserted rows distributing more or less evenly among all the data blocks of the table, they are concentrated in data blocks that correspond to the proper month. This increases the average hits per block count, which is a key measure of Teradata Parallel Transporter efficiency, as well as reducing the number of blocks that must be rewritten.
- The short-running, single-AMP queries are not affected by partitioning. Because the partitioning column is a component of the primary index, primary index access to the table is not changed.
- The largest gain is seen in the ad hoc queries comparing current month sales to a prior month. Only two of the 25 partitions need to be read, instead of the full-table scan required for a nonpartitioned table. This means that the number of disk reads will be reduced by roughly 92%, with a corresponding reduction in elapsed query response time. The 92% figure applies to the step that reads the sales table, not to the sum of all the steps used to implement the query.
Given the stated assumptions, the other steps should take roughly the same amount of time to complete as they did previously.
- The same considerations apply to the agent analysis queries. The number of partitions that must be read is determined by the time period specified in the query is known in most cases to be three or fewer. Even if the analysis is over twelve full months, there is roughly a 50% gain in reading twelve of 25 partitions for the step that reads the sales table.
- The decision support queries that analyze 24 months of sales data take roughly the same time and resources as they did for a nonpartitioned primary index table, although there is a marginal gain realized by reading 24 instead of 25 partitions. If the analysis is for 24 months plus the current month (the entire table), then the resource usage is the same as with the nonpartitioned primary index incarnation of the table because full-table scans are not affected by partitioning the primary index.
- The joins take the same amount of time. In the defined workload, the EXPLAIN reports indicate that there are no direct joins against the sales table. Instead, spool is always created from the sales table and the spool is then joined to the smaller tables. In the partitioned table case, the spool might be created more efficiently, but once created it is exactly the same as it is for the nonpartitioned table.
Conclusions
The results show there are few disadvantages, and some significant advantages, to row partitioning this table given the workloads that access it. The partitioned table requires somewhat more disk space than its nonpartitioned counterpart. There is a 2-byte or 8-byte partition number recorded in each row that consumes additional storage space; however, the percentage increase seen for most row sizes does not exceed about 5%, and is often considerably less than that.
The following table summarizes the improvement opportunities for this case study:
Activity | Nonpartitioned Table | Partitioned Table | Improvement | Comments |
---|---|---|---|---|
Monthly delete of one month of data | Teradata Parallel Transporter job reads most blocks, updates most blocks | ALTER TABLE statement deletes partition | Much faster performance | Easier maintenance |
Nightly inserts | Inserted rows scattered throughout table | Inserted rows concentrated in one partition | Faster performance | No changes to load script needed |
Primary index access | 1 block read | 1 block read | No change | No SQL changes needed |
Comparison of current month to prior month | All blocks read | 2 partitions read | Step is 12 times faster (only 2 of 25 partitions read) | No SQL changes needed |
Trend analysis over entire table | All blocks read | All blocks read | Little change |
|
Joins | No direct Merge Joins | No direct Merge Joins | Little change | No direct merge joins in this example because of the choice of primary index. |