Table Structure and Update Schedule
While a transaction date or timestamp is frequently a good choice for the partitioning column, other choices present themselves for consideration for other categories of workloads and data. Consider a table with detailed information about telephone calls maintained by a telecommunications company. This table stores rows that contain the originating telephone number, a timestamp for the beginning of the call, and the duration of the call, among other things, for each outgoing call. Rows are retained for a variable length of time, but rarely for more than six weeks. Retention is based on the call date and the monthly bill preparation date. The primary index is a composite of the telephone number and the call-start timestamp. This column set implies that the index was chosen to provide good distribution, not to facilitate data access, and also that the likelihood of any direct primary index joins is remote.
Some queries analyze all calls from a particular telephone number, while others analyze all calls for a particular period of time, perhaps for as long as a month, for customers meeting certain criteria.
The current definition of the call detail table does not use a partitioned primary index:
CREATE TABLE calldetail ( phone_number DECIMAL(10) NOT NULL, call_start TIMESTAMP, call_duration INTEGER, call_description VARCHAR(30)) PRIMARY INDEX (phone_number, call_start);
Can the standard query workloads against this table be optimized by partitioning its primary index?
Analysis of Partitioning Benefits
There are two candidate schemes for partitioning this table. The first partitioning scheme is to partition the rows by call_start, probably defining one partition for each day. The results of this partitioning scheme for this table and query workload look like this.
- The timestamp partitioning scheme helps with inserting new transaction activity in the same way as the previous scenario (Scenario 1).
- Little performance gain is realized for row deletion because the delete operations are not performed strictly by call date. Because of this, the deleted rows are not clustered in a single partition.
- The analysis queries based on call_start benefit from this partitioning scheme, with those queries that specify a range of a few days realizing the greatest gain.
The second possibility for partitioning this table is to use the telephone number. Telephone numbers contain too many digits to assign each number its own partition, but a subset of the digits can be used profitably to optimize telephone number-based row access. If, for example, the table is partitioned on the high order three digits of the telephone number, there are 1,000 partitions, some of which are always be empty because of the way telephone companies assign numbers. The results of this partitioning scheme for this table and query workload look like this:
- This partitioning scheme does not improve the performance of Teradata Parallel Transporter bulk inserts or deletes because they are scattered across all partitions.
- The scheme does not facilitate date-based queries.
- The scheme allows queries that specify a telephone number to run much faster than they otherwise would because only one partition out of 500 or more partitions must be read to access the rows having that number.
- The scheme benefits geographic area analysis, at least in North America, because the first three digits of North American telephone numbers uniquely identify a narrowly defined geographic region.
If 1,000 partitions improve performance, then defining 10,000 partitions using the first four digits of the telephone number would probably improve performance even more. If 10,000 partitions are good, then 50,000 partitions might be better yet.
The DDL for the redefined call_detail table looks like this:
CREATE TABLE ppi_call_detail ( phone_number DECIMAL(10) NOT NULL, call_start TIMESTAMP, call_duration INTEGER, call_description VARCHAR(30)) PRIMARY INDEX (phone_number, call_start) PARTITION BY RANGE_N(phone_number/100000 (INTEGER) BETWEEN 0 AND 99999 EACH 2);
If mapping a geographic area to one or more partitions does not solve an application problem, then another potential solution is to maximize the number of partitions by using telephone number modulo 65,535 as the partitioning expression. Assuming the cardinality of the table is roughly 3.276 billion rows, then the average partition contains roughly 50,000 rows with this scheme. If the system has 100 AMPS, then each AMP contains roughly 500 rows per partition, a number that fits into one data block if the row width is fairly narrow. The decrease in response time of a single-partition scan for all activity for a particular telephone number is dramatic compared to the full-table scan that would be required for a nonpartitioned table.
A query to return activity for one telephone number from this table is a best case scenario for single-table response time improvement by using row partitioning. Disregarding the overhead cost of initiating the query and returning the answer set, the elapsed time could be reduced to 1/65535 of the time using a nonpartitioned table. Including the query initiation and termination overhead, the total query time improvement would be somewhat less than a factor of 65,535, but could be less than 1/1,000 of the nonpartitioned primary index time. The DDL for the ppi_call_detail table using this scheme is as follows.
CREATE TABLE ppi_call_detail ( phone_number DECIMAL(10)NOT NULL, call_start TIMESTAMP, call_duration INTEGER, other_columns CHARACTER(30)) PRIMARY INDEX (phone_number, call_start) PARTITION BY phone_number MOD 65535 +1;
The workload mix determines which, if any, of these proposed partitioning schemes is best. You can begin your analysis with the extended logical data model, but actual testing of the various proposed scenarios is often required.
The partitioning scheme you use should be tied directly to the application workload accessing its data. For this scenario, the proposed partitioning schemes affect geographic localization applications differently than they do call date-based applications. To decide which scheme is better, you need to know all the various ways the table is currently accessed and have a solid concept of how it might be accessed in the future.
You need to examine the relative efficiencies of different partitioning schemes for the same table must from that perspective.