Row Partition Elimination With Multilevel Partitioning
The following examples taken from various industries present examples of how multilevel partitioned primary indexes and row partition elimination can greatly enhance the performance of a query workload.
Example From the Insurance Industry
There are many cases for which row partition elimination using multiple expressions for WHERE clause predicate filtering can enhance query performance (see SQL Request and Transaction Processing for details about the various forms of row partition elimination). For example, consider an insurance company that frequently performs an analysis for a specific state and within a date range that constitutes a relatively small percentage of the many years of claims history in its data warehouse.
If an analysis is being performed only for claims filed in Connecticut, only for claims filed in all states in June, 2005, or only for Connecticut claims filed during the month of June, 2005, a partitioning of the data that allows elimination of all but the desired claims should deliver a dramatic performance advantage.
The following example shows how a claims table could be partitioned by a range of claim dates and subpartitioned by a range of state identifiers using multilevel partitioning.
Consider the following multilevel PPI table definition:
CREATE TABLE claims (
claim_id INTEGER NOT NULL,
claim_date DATE NOT NULL,
state_id BYTEINT NOT NULL,
claim_info VARCHAR(20000) NOT NULL)
PRIMARY INDEX (claim_id)
PARTITION BY (RANGE_N(claim_date BETWEEN DATE '1999-01-01'
AND DATE '2005-12-31'
EACH INTERVAL '1' MONTH),
RANGE_N(state_id BETWEEN 1
AND 75
EACH 1))
UNIQUE INDEX (claim_id);
Eliminating all but one month out of their many years of claims history would facilitate scanning of less than 5% of the claims history (because of business growth, there are many more recent than past claims) for satisfying the following query:
SELECT *
FROM claims
WHERE claim_date BETWEEN DATE '2005-06-01' AND DATE '2005-06-30';
Similarly, eliminating all but the Connecticut claims from the many states in which this insurance company does business would make it possible to scan less than 5% of the claims history to satisfy the following query:
SELECT *
FROM claims, states
WHERE claims.state_id = states.state_id
AND states.state = 'Connecticut';
State selectivity varies by the density of their business book. The company has more business in Connecticut than, for example Oregon and, therefore, a correspondingly larger number of claims for Connecticut. Note that the partitioning in the example specifies up to 75 state_id values to allow for any of the 50 US states plus US territories and protectorates such as Puerto Rico, Guam, American Samoa, the Marshall Islands, Federated States of Micronesia, Northern Mariana Islands, Palau, and the American Virgin Islands in the future.
Combining both of these predicates for row partition elimination makes it possible to scan less than 0.5% of the claims history to satisfy the following query.
SELECT *
FROM claims, states
WHERE claims.state_id = states.state_id
AND states.state = 'Connecticut'
AND claim_date BETWEEN DATE '2005-06-01' AND DATE '2005-06-30';
For evenly distributed data, you would expect scanning of less than 0.25% of the data; however, the data for this company is not evenly distributed among states and dates, leading to a higher percentage of data to scan for the query.
Clearly, combining both predicates for row partition elimination has a significant performance advantage. Row partition elimination by both of these columns, as described, provides higher performance, more space efficiency, and more maintenance efficiency than a composite NUSI or join index for most of the queries run at this insurance company.
In fact, the performance advantage described previously could theoretically be achieved with the current single‑level partitioning by using a single partitioning expression that combines both state and monthly date ranges. There are issues with using such a complex, single partitioning expression in this scenario but it is possible to do so.
The bigger problem is that a significant portion of the workload at this insurance company does not specify equality conditions on both partitioning columns. Though theoretically possible, Teradata Database is currently unable to evaluate more complex conditions with such a partitioning expression. With single‑level partitioning, you would have to choose between partitioning by state or partitioning by date ranges.
If you chose partitioning by state, and a user submits a query that specifies a narrow date range without a state filter in the WHERE clause, the system does not perform row partition elimination. At this insurance company, both state‑level (and, in some cases, also for a specific date range) analysis and enterprise‑level (all states, but for a specific date range) analyses are common.
The users performing state‑level analysis do not want to be penalized by having their data combined with all other states, but the users performing enterprise‑level analyses also want their queries to be reasonably high-performing, at least by getting date range elimination, and be easy to construct.
Example From the Retail Industry
An example for a large retailer is similar to the insurance example, substituting division for state. In this case, the differences in size of partitions are even more exaggerated. For this company, division 1 is all of the United States, and represents greater than 85% of the entire business. But there are a number of divisions in other countries. The retailer currently replicates the data model for each country in order to provide reasonably efficient access to the data for an individual country. However, this is difficult to maintain, and limits the ability to do analyses across the entire company.
With multilevel partitioning, all the data can be stored in the same table, and analyses can be performed efficiently across either all of the data, or with subsets of the data. This example is also applicable to manufacturers with multiple divisions and date associated data.
Example From the Telecommunications Industry
Now consider the telecommunications industry. An example might be providing access to corporate billing information while also using the data for wider analysis. The difference is that there are many more customers than there are states or divisions. For example, suppose the company wants to start with 3,000 business customers, but they expect that number to grow considerably. This in turn results in more finely grained partitions. It also puts a great deal more pressure on the number of partitions and, therefore, limits choices in the granularity of date ranges or the next level partitioning.
You can use multilevel partitioning to improve query performance via row partition elimination, either at each of the partition levels or by combining all of them. Multilevel partitioning provides multiple access paths to the rows in the base table. As with other indexes, the Optimizer determines if the index is usable for a query and, if usable, whether its use provides the estimated least costly plan for executing the query.
You create multilevel partitioning by specifying two or more partitioning expressions, where each expression must be defined using either a RANGE_N function or a CASE_N function exclusively. The system combines the individual partitioning expressions internally into a single partitioning expression that defines how the data is partitioned on an AMP.
For example, assume there are three partitioning expressions, p1, p2, and p3, defined as follows:
The first partitioning expression is the highest level partitioning. Within each of those partitions, the second partitioning expression defines how each of the highest‑level partitions is subpartitioned. Within each of those second‑level partitions, the third‑level partitioning expression defines how each of the second level partitions is subpartitioned.
Within each of these lowest level partitions, rows are ordered by the row hash value of their primary index and their assigned uniqueness value.
Assume that the number of partitions defined at each level is d1, d2, and d3, respectively. Multilevel partitioning organizes the rows of a table in a similar manner as the single-level partitioning expression (referred to as the combined partitioning expression) defined as follows:
where:
Equation element … |
Specifies … |
d1 |
the number of partitions defined at level 1. |
d2 |
the number of partitions defined at level 2. |
d3 |
the number of partitions defined at level 3. |
p1 |
partitioning expression 1. |
p2 |
partitioning expression 2. |
p3 |
partitioning expression 3. |
dd1 |
d2 * d3 |
dd2 |
d3 |
The product of d1 * d2 * d3 cannot exceed 65,535, nor can it be less than 8. This limit restricts the number of levels and the number of partitions that you can define at each level.
Advantages of multilevel partitioning include the following:
For the above single-level partitioning expression, Teradata Database cannot enforce that d1, d2, and d3 are the same as the corresponding number of partitions defined for p1, p2, and p3, respectively, at creation time; it can only enforce that the final result of the expression for a row must be between 1 and 65,535, inclusively, when rows are inserted or updated.
Note that the number of levels of partitioning cannot exceed 15 for 2‑byte partitioning or 62 for 8‑byte partitioning because each level must define at least two partitions. This is because 216 = 65,536 and 263 = 9,223,372,036,854,775,808, both of which exceed the maximum number of partitions that can be defined for a given partitioning by 1. The number of levels of partitioning might be further restricted by other limits such as the maximum size of the table header, data dictionary entry sizes, and so on.