Basing Partitioning on a Numeric Column
This is the form that is the easiest to code, but is rarely the optimal choice.
If a table has a division_number column defined as an integer with values between 1 and 65535, the following CREATE TABLE fragment can be used to establish partitioning by the division number:
CREATE TABLE … PRIMARY INDEX (invoice_number) PARTITION BY division_number;
If the division_number column for an attempted INSERT or UPDATE request has a value of 0 or less, is null, or is greater than 65535, then the request aborts and the system returns an error to the requestor. The row is not inserted or updated.
This clause can also be used if division_number is defined with a CHARACTER data type, provided the values are always digits so that the column can be cast to INTEGER, because the system performs an implicit cast to the INTEGER type whenever necessary.
You can specify as many as 65,535 partitions if there are that many distinct values for the column. If you assume the more likely situation in which a company has, perhaps, four divisions, then there are only four partitions with rows. The table is not any larger because of the empty partitions, though the default Optimizer assumption that there are 65,535 partitions might sometimes mislead it into making suboptimal plan choices.
A disadvantage of this form is that the partitioning cannot be altered unless the table is empty.
Basing Partitioning on Modulo Partitioning of a Numeric Column
This form uses a single column with a wide range of values, and maximizes the number of partitions. As an example, assume that a telephone company has a table with detailed information about each outgoing telephone call. One of the columns is the originating phone number, defined with the DECIMAL data type. A possible, and perhaps useful, partitioning expression can be defined as follows:
CREATE TABLE … PRIMARY INDEX (phone_number, call_start) PARTITION BY phone_number mod 65535 + 1;
This partitioning expression, assuming millions of subscribers, populates each of the 65,535 partitions. Some partitions might have more rows than others, because some customers make more phone calls than others, but the distribution among the partitions should be somewhat even. This partitioning expression can improve the performance of a query that examines all phone calls made from a particular number by orders of magnitude by scanning only one partition out of 65,535 instead of the entire table.
Some disadvantages of this form are that the partitioning cannot be altered unless the table is empty, a maximum on 65535 partitions can be defined, and row partition elimination for queries is usually limited to constant or USING value equality conditions on the partitioning column.
Basing the Partitioning Expression on Two or More Numeric Columns
This form uses arithmetic operations, typically multiplication and addition, on two or more numeric columns with suitably small value ranges. Assume a table with a three-digit product code and a two-digit store number. The store numbers count consecutively from 0, and there are fewer than 65 stores. This table can be partitioned as follows:
CREATE TABLE … PRIMARY INDEX (store_number, product_code, sales_date) PARTITION BY store_number * 1000 + product_code;
If many queries specify both store_number and product_code, this might be a useful partitioning expression. One downside is that it fails if the number of products grows to the point that a four-digit number is required, or if the number of stores expands beyond 64.
Note that the Optimizer assumes 65,535 partitions even though some might be empty. The table is not any larger because of the empty partitions, though the Optimizer default assumption that there are 65,535 partitions based on the specification might sometimes mislead it into making suboptimal plan choices.
Some disadvantages of this form are that the partitioning cannot be altered unless the table is empty and row partition elimination for queries is usually limited to constant or USING value equality conditions on both of the partitioning columns.
An alternative is to use multilevel partitioning, as demonstrated by the following CREATE TABLE request:
CREATE TABLE … PRIMARY INDEX (store_number, product_code, sales_date) PARTITION BY (RANGE_N(store_number BETWEEN 0 AND 64 EACH 1), RANGE_N(product_code BETWEEN 1 AND 999 EACH 1));
Basing the Partitioning Expression on a CASE_N Function
You can use the CASE_N function o concisely define a partitioning expression for which each partition contains data based on an associated condition. When you specify CASE_N for single-level partitioning expression, two partition numbers, NO CASE and UNKNOWN, are automatically reserved for specific uses.
The CASE_N function is patterned after the SQL CASE expression (see SQL Functions, Operators, Expressions, and Predicates). The function returns an INTEGER value numbered from 1, indicating which CASE_N condition first evaluated to TRUE for the particular value. The returned value can map directly to a partition number or be further modified to calculate the partition number.
Assume a table has a total_revenue column, defined as DECIMAL. The table can be partitioned on that column, so that low revenue products are separated from high revenue products. The partitioning expression can be written as follows:
CREATE TABLE … PRIMARY INDEX (store_id, product_id, sales_date) PARTITION BY CASE_N (total_revenue < 10000, total_revenue < 100000, total_revenue < 1000000, NO CASE, UNKNOWN);
This request defines 5 partitions, conceptually numbered from 1 to 5 in the order they are specified in the partitioning expression.
|This partition number …||Represents …|
|1||products with total_revenue less than 10,000.|
|2||products with total_revenue of at least 10,000, but less than 100,000.|
|3||products with total_revenue of at least 100,000 but less than 1,000,000.|
|4 (NO CASE)||any value that does not evaluate to TRUE or UNKNOWN for any previous CASE_N condition, which in this case is total_revenue equal to or greater than 1,000,000.|
|5 (UNKNOWN)||values for which it is not possible to determine the truth value of a previous CASE_N expression.
For this partitioning condition, a row with a null for total_revenue is assigned to the UNKNOWN partition, because by definition, it is not possible to evaluate whether a null is less than 10,000.
The rows in the NO CASE and UNKNOWN partitions are valid, and the system accesses those partitions to process queries unless the query conditions exclude them. By defining NO CASE and UNKNOWN partitions, you can ensure that any possible value maps to a partition. In the absence of those partitions, some values result in errors and so are not be inserted into the table. In practice, it is probably better to have more than three revenue ranges unless the queries against this table rarely specify narrower revenue ranges.
This example demonstrates that CASE_N can be used to define complicated partitioning expressions tailored to a specific table and specific query workloads:
CREATE TABLE … PRIMARY INDEX (col1, col2) PARTITION BY CASE_N (col3 < 6, col3 >= 8 AND col3 < 10 AND col4 <> 12, col5 <> 10 OR col3 = 20, NO CASE OR UNKNOWN);
Without knowing the meaning and data demographics of the columns, there is no way of knowing whether this partitioning expression is useful.
Note that the NO CASE and UNKNOWN partitions are combined into a single partition in this example.
Unlike the case for the previous partitioning expression examples, the Optimizer knows how many partitions are defined when CASE_N is used as the partitioning expression and does not have to assume a default number of 65,535 partitions.
There are several disadvantages to this form:
- The partitioning of the table cannot be altered unless it is empty.
- Row partition elimination for queries is often limited to constant or USING value equality conditions on the partitioning columns.
- The Optimizer might not eliminate some row partitions that it possibly could if the partitioning were better conceived.
- As the number of conditions increases, evaluating a CASE_N function can be costly in terms of CPU cycles, and a CASE_N might also contribute to causing the table header to exceed its maximum size limit. Therefore, you may need to limit the number of conditions you define in the CASE_N function to a relatively small number.
Considerations for Basing a Partitioning Expression on a CASE_N Function
Building a partitioning expression on CASE_N is a reasonable thing to do only if the following items are all true:
- The partitioning expression defines a mapping between conditions and INTEGER numbers.
- There are limited number of conditions (too many conditions can lead to excessive CPU usage or the table header exceeding its maximum size limit).
- Your query workloads against the table use equality conditions on the partitioning columns to specify a single partition.
- You have no need to alter the partitioning of the table.
- You get the plans and data maintenance you need.
Basing a Partitioning Expression on a RANGE_N Function
The RANGE_N function is provided to simplify the specification of common partitioning expressions where each partition contains a range of numeric data, and is especially useful when the column contains a date or timestamp. RANGE_N returns an INTEGER determined by the first range that includes the column value, numbered from 1, that can be mapped directly to a partition number or can be further modified to calculate the partition number. RANGE_N is commonly used to define partitioning expressions. When you use a RANGE_N function to define a single-level partitioning expression, two partition numbers, NO RANGE and UNKNOWN, are reserved for specific uses.
Assume a table with 7 years of order data, ranging from 2001 through 2007. The next partitioning expression creates 84 partitions, one for each month of the period covered by the data:
CREATE TABLE … PRIMARY INDEX (order_number) PARTITION BY RANGE_N (order_date BETWEEN DATE '2001-01-01' AND DATE '2007-12-31' EACH INTERVAL '1' MONTH);
Each partition contains roughly the same number of rows, assuming that order volume has stayed roughly constant across the seven year interval. Neither a NO RANGE nor an UNKNOWN partition is defined because this partitioning expression definition is for data that only has dates within the ranges specified in the partitioning expression.
It is frequently desirable to have each partition contain roughly the same number of rows, but it is not required. The next example puts the older orders into partitions with coarser granularity, and the newer orders into partitions with finer granularity:
CREATE TABLE … PRIMARY INDEX (order_number) PARTITION BY RANGE_N (order_date BETWEEN DATE '1994-01-01' AND DATE '1997-12-31' EACH INTERVAL '2' YEAR, DATE '1998-01-01' AND DATE '2000-12-31' EACH INTERVAL '1' YEAR, DATE '2001-01-01' AND DATE '2003-12-31' EACH INTERVAL '6' MONTH, DATE '2004-01-01' AND DATE '2007-12-31' EACH INTERVAL '1' MONTH, NO RANGE, UNKNOWN);
In this example, the more recent data is partitioned more finely than the older data. This can be a good strategy if you know that the older data is rarely accessed except as part of a full-table scan, because it reduces some potential disadvantages by defining a smaller number of partitions. However, maintaining this structure over extended epochs of time is not as simple as maintaining a structure in which each interval covers the same time duration. In this example, the years 2004 through 2007 are partitioned by month. As time passes, and 2004 data becomes older and less frequently referenced, it shall become necessary to repartition the table if the pattern of defining longer time intervals for older data is to be maintained.
It is both easy and fast to add and drop partitions from the ends of a partitioning definition, but repartitioning intervals in the middle partitions requires much more work, and it is usually faster to reload the data. In addition, when a range partition is dropped, any rows in that partition are moved from the dropped range partition to the NO RANGE partition or to an added range partition. Also, when a range is added, rows might need to be moved from the NO RANGE partition to the new range partition.
Some expansion room is allowed for future dates by specifying the final partition as extending to the end of 2007. It is easy to add and drop ranges to the end of a partitioning expression using the ALTER TABLE statement. While you could have specified an ending date far in the future, such as 2099-12-31, it is generally not desirable to define hundreds of partitions that shall not be used for decades.
The example shows intervals of years and months. It is also possible to partition by day (EACH INTERVAL '1' DAY) or by week (EACH INTERVAL '7' DAY). A seven-day interval can start on any day of the week, so if you want to start the weekly intervals on Sunday, for example, the beginning date should be chosen so that it falls on a Sunday. Also, the last range might be less than specified by the EACH clause. For example, suppose that with seven-day intervals the first date falls on Sunday, but the last date also falls on Sunday, in which case the last range spans only one day.
The RANGE_N NO RANGE clause is comparable to the CASE_N NO CASE clause, and the UNKNOWN clause has the same meaning for both functions.
As with the CASE_N partitioning expression, the Optimizer knows how many partitions are defined when RANGE_N is used as the partitioning expression.
The following example revisits the division_number example, this time using a RANGE_N function:
CREATE TABLE … PRIMARY INDEX (invoice_number) PARTITION BY RANGE_N (division_number BETWEEN 1 AND 4 EACH 1);
This alternate partitioning expression allows you to specify only four partitions instead of 65,535. The Optimizer might be able to choose certain join plans and more accurately cost those plans when the maximum number of partitions is both known and small, making this a better choice than using the column directly. This RANGE_N partitioning expression also prevents rows with non-valid division numbers from being inserted into the table.
Considerations for Basing the Partitioning Expression on a RANGE_N Function
Using the RANGE_N function to build a partitioning expression offers the following advantages:
- Defining an efficient mapping or ranges between integer (BYTEINT, SMALLINT, INTEGER, BIGINT), character (CHARACTER, GRAPHIC, VARCHAR, VARCHAR(n) CHARACTER SET GRAPHIC), DATE, or TIMESTAMP type and INTEGER numbers.
- Provides more opportunities than other expressions to optimize queries.
- The Optimizer knows the number of defined partitions when you specify RANGE_N to define the partitioning expression.
For other partitioning expressions, the Optimizer generally assumes a total of 65,535 partitions when statistics have not been collected, which could easily be far more than the number of populated partitions. However, collecting statistics on PARTITION can provide information about which partitions are empty.
- Faster partitioning changes than any other expression using the following ALTER TABLE options.
- ADD RANGE
- DROP RANGE
You can optimize the effects of using RANGE_N in your partitioning expression by observing the following guidelines:
- Reference only a single integer (BYTEINT, SMALLINT, INTEGER, BIGINT), character (CHARACTER, GRAPHIC, VARCHAR, VARCHAR(n) CHARACTER SET GRAPHIC), DATE, or TIMESTAMP column, not expressions.
For example, specifying a simple expression such as x/10 in place of a column name in the RANGE_N specification, even if the expression references only a single column, can hinder, or even prevent, row partition elimination.
- For equal-sized ranges, always specify an EACH clause.
Note the following collateral facts about equal- and unequal-sized partitions:
- The performance of unequal-sized partitions varies depending on which partitions are accessed.
- Unequal size ranges can prevent fast partitioning changes from being made using the ALTER TABLE statement (see SQL Data Definition Language).
- Using the NO RANGE, NO RANGE OR UNKNOWN, or UNKNOWN specifications for a range can negatively affect later ALTER TABLE partitioning strategies.
Do not use these clauses unless you have specific reasons for doing so (see SQL Functions, Operators, Expressions, and Predicates for details).
Reasons not to use the NO RANGE, NO RANGE OR UNKNOWN, and UNKNOWN clauses include the following.
- The maintenance and use of partitioned tables is simpler if you do not use these options, and avoiding their use also prevents bad data from being inserted into the table.
- If these partitions are not eliminated by row partition elimination, they can cause negative performance impacts if they contain a large of number of rows.
- If a partitioning column is NOT NULL such that test values can never be null, do not specify UNKNOWN or NO RANGE OR UNKNOWN.
- If the specified ranges cover all possible values, do not specify NO RANGE or NO RANGE OR UNKNOWN.
Basing the Partitioning Expression on a RANGE_N Character Column
The RANGE_N function provides a simplified method for mapping a character value into one of a list of specified ranges, and then returning the number of that range. A range is defined by its starting and ending boundaries, inclusively.
If you do not specify an ending boundary, the range is defined by its starting boundary, inclusively, up to, but not including, the starting boundary for the next range.
The EACH clause is not supported for character test values, and Teradata Database returns an error if you specify an EACH clause in a RANGE_N function with a character test value. Each range specified for expressions character data types maps to exactly an integer range of only one value. Therefore, the number of partitions that can be specified is somewhat limited because of table header limitations and due to the increase in CPU usage to handle a large number of ranges.
You can specify an asterisk for the first starting boundary to indicate the lowest possible value, and you can specify an asterisk for the last ending boundary to indicate the highest possible value.
As with numeric RANGE_N expressions, options are provided to handle cases when the value does not map into one of the specified ranges or evaluates to UNKNOWN because of a null result, making it impossible to determine into which range the value would map.
The following rules apply to the use of the RANGE_N function for character data in addition to the rules that exist for numeric data:
- You can specify only one test value, and it must result in an integer (BYTEINT, INTEGER, SMALLINT, BIGINT), character (CHARACTER, GRAPHIC, VARCHAR, VARCHAR(n) CHARACTER SET GRAPHIC), DATE, or TIMESTAMP data type.
- A RANGE_N partitioning expression can specify the UPPERCASE qualifier and the following functions.
- VARCHAR(n) CHARACTER SET GRAPHIC
- Teradata Database aborts the request and returns an error if any of the specified ranges are defined with null boundaries, are not increasing, or overlap. Increasing order is determined using the session collation and the case sensitivity specification for the test value at the time the table is created.
Using CASE_N and RANGE_N in SELECT Requests
You can also use the CASE_N and RANGE_N functions in a SELECT request. For example, you might use them to help determine the distribution of rows among partitions for a proposed partitioning expression. For example, before deciding to partition on division_number, you might want to check the resulting distribution with a request like the following.
SELECT RANGE_N(division_number BETWEEN 1 AND 4 EACH 1) AS p, COUNT(*) AS c, FROM sales GROUP BY p ORDER BY p;
Another use of RANGE_N is to determine the number of ranges defined, for example, as follows.
SELECT RANGE_N(DATE '2007-12-31' BETWEEN DATE '2001-01-01' AND DATE '2007-12-31' EACH INTERVAL '30' DAY);
This query returns the value 86 because the last range is less than 30 days.
The following query over the same data returns the value 84 because the number of days per range varies between 28 and 31, depending on the month and year.
SELECT RANGE_N(DATE '2007-12-31' BETWEEN DATE '2001-01-01' AND DATE '2007-12-31' EACH INTERVAL '1' MONTH);
The final example returns the value 13 because the last range is only one day in length.
SELECT RANGE_N(DATE '2002-01-01' BETWEEN DATE '2001-01-01' AND DATE '2002-01-01' EACH INTERVAL '1' MONTH);
Workload Characteristics, Queries, and Row Partition Elimination
Row partition elimination is most effective in the following situations. You should always verify (using the EXPLAIN request modifier) that you are getting the desired results for any plans.
- Row partition elimination is most effective with constant conditions on the partitioning columns.
- When a row partitioning expression is written using something other than the RANGE_N function or a single column, row partition elimination is most effective when you specify constant equality conditions.
- Row partition elimination can also be effective with equality conditions on USING variables if the conditions specify a single partition.
- Row partition elimination occurs for CURRENT_DATE and DATE built-in functions for inequality conditions. This does not prevent the request from being cached.
- Row partition elimination might occur for other built-in functions and USING variables in inequality conditions, and if it does, the action prevents the system from caching the request.
- Multiple ORed equality conditions on the same row partitioning column do not invoke partition elimination.
As an alternative, you should try either to use the UNION operator on two SELECT requests or to substitute constants for the USING variables in any inequality conditions.
- Use simple comparison of a partitioning column to a constant, built-in function, or USING variable expression for your query conditions.
- d=10d >= 10 AND d <= 12
- d BETWEEN 10 AND 12d = 10+1
- d IN (20, 22, 24)d = 20 OR d=21
- d = :udd BETWEEN CURRENT_DATE-7 and CURRENT_DATE-1
- Avoid specifying query conditions with expressions or functions constructed on the row partitioning column. For example, use the form in Example 2 rather than the form in Example 1, and the form in Example 4 rather than the form in Example 3.
The predicate in this query is based on an expression constructed on the row partitioning column x:
CREATE TABLE ... PARTITION BY x; SELECT ... WHERE x+1 IN (2,3);
The predicate in this query is based only on the value of the row partitioning column x:
CREATE TABLE ... PARTITION BY RANGE_N(x BETWEEN 1 AND 65533 EACH 1);
SELECT ... WHERE x IN (1,2);
The predicate in this query is based only on the value of row partitioning column x even though the table is partitioned by the expression x + 1:
CREATE TABLE ... PARTITION BY x+1; SELECT ... WHERE x IN (1,2);
The predicate in this query is based only on the value of row partitioning column x:
CREATE TABLE ... PARTITION BY RANGE_N(x BETWEEN 0 AND 65532 EACH 1);
SELECT ... WHERE x IN (1,2);
Workload Characteristics and Row Partitioning
Consider using row partitioning when your workloads have any of the following characteristics:
- The number of queries in workloads that access the table have a range constraint, particularly a date constraint on some column of the table.
- Queries have an equality constraint on some column of the table, and that column is either not the only primary index column or it is not a primary index column at all.
- If there is a primary index that is used only, or principally, to achieve an even distribution of rows, but not usually for accessing or joining rows, and access is frequently made on a column that is suitable for partitioning.
- If there is a primary index that is used to achieve an even distribution of rows as well as for accessing or joining rows, and columns suitable for partitioning are included in the primary index definition.
- If there is a primary index that is used to achieve an even distribution of rows as well as for accessing or joining rows, but columns suitable for partitioning are not included in the primary index definition.
This might be a good candidate for partitioning, but you need to pay particular attention to weighing the performance tradeoffs that often result in this situation.
- Use the RANGE_N function for a partitioning expression, preferably on a column with a DATE or TIMESTAMP data type, because it generally provides more opportunities for row partition elimination, and the Optimizer knows the exact number of defined row partitions.
For instance, instead of the following row partitioning expression.
PARTITION BY column
PARTITION BY RANGE_N(column BETWEEN m AND n EACH s)
Dates and timestamps are often used in query conditions and therefore makes good candidates for a partitioning expression.
When partitioning on a DATE column, use RANGE_N with a single overall range divided into ranges of equal size as follows.
PARTITION BY RANGE_N(date_column BETWEEN DATE '...' AND DATE '...' EACH INTERVAL 's' t)
- Use DATE constants or TIMESTAMP constants such as DATE '2011-08-06' or TIMESTAMP '2011-08-25 10:14:59' to specify the ranges in a partitioning expression. This is not only easier to read, making it clear that the expression is defined on a date, but it also removes the dependence on the FORMAT used in the implicit conversion to a date. You can also specify TIMESTAMP(n) WITH TIME ZONE constants for RANGE_N partitioning.
For example, you can specify a RANGE_N-based partitioning expression like the following.
RANGE_N(ts BETWEEN TIMESTAMP '2003-01-01 00:00:00+13:00' AND TIMESTAMP '2009-12-31 23:59:59-12:59' EACH INTERVAL '1' MONTH)
Use an INTERVAL constant in the EACH clause where the variable t is DAY, MONTH, YEAR, or YEAR TO MONTH.
Do not use INTEGER values or CHARACTER constants for dates in your partitioning expressions, since they can easily be incorrectly specified such that they do convert to the date you expected.
For example, it might seem intuitive to simply partition by the name of a DATE column as follows:
PARTITION BY sales_date
This form does not produce a syntax error. In fact, it works correctly only for dates in the early 1900s and follows the rule of implicit conversion to get an INTEGER partition number, but such a table is not generally useful.
For this case, you should instead use RANGE_N with a granularity of EACH INTERVAL '1' DAY.
It might also seem intuitive to specify something like the following to indicate that a date column in the primary index is to be partitioned by week:
PARTITION BY 7
However, this form does produce a syntax error.
For this case, use RANGE_N with a granularity of EACH INTERVAL '7' DAY.
- Consider specifying only as many date ranges as are currently needed plus a few additional ranges for the future.
By limiting ranges to those that are currently needed, you help the Optimizer to better cost plans and also allow for more efficient primary index access, joins, and aggregations when the partitioning column is not included in the primary index.
This is not as important if you collect current PARTITION statistics.
A good guideline is to define 10% or fewer of the partitions to be empty to be able to handle future dates. You should also define enough future ranges to minimize the frequency of ALTER TABLE statements needed to drop and add ranges. However, if you make changes too infrequently, you might forget to alter the table entirely.
Altering the table only once a year is probably not a good idea for the following reasons:
- Because you must create too many empty partitions.
- Because it is too easy to forget to alter the partitioning ranges if you do not do it fairly regularly.
- Because you fail to follow the procedure often enough to prevent problems from occurring when you finally get around to following it.
You must balance these concerns of having enough future partitions, but not too many.
- RANGE_N permits a faster partitioning change using ALTER TABLE … DROP RANGES or … ADD RANGES.
- Reference a single INTEGER or DATE column in the RANGE_N function.
Do not use an expression such as x/10 in place of a simple column reference in a partitioning expression constructed from a RANGE_N function even if the expression only references a single column.
- Define ranges of equal size using EACH to specify the granularity of the partition.
Multiple ranges, with or without specifying an EACH granularity, require more CPU time to execute and different sized ranges can prevent fast partitioning changes.
- Consider not specifying the NO RANGE, NO RANGE OR UNKNOWN, or UNKNOWN partitions in the RANGE_N function.
Using these partitions can degrade query performance because queries can be forced to scan data in these partitions unnecessarily. If you specify the NO RANGE, NO RANGE OR UNKNOWN, or UNKNOWN partitions, you can also affect the performance of ALTER TABLE partitioning change negatively because data might need to be moved among the partitions. An UNKNOWN partition is not needed if the partitioning cannot produce rows with unknown partition values (this is often the case when a partitioning column is specified to be NOT NULL).
- Deciding not to use a RANGE_N function for a partitioning expression can be a good choice in the following instances:
- You do not partition the table or join index on a DATE column.
- You use constant or USING variable equality conditions on the partitioning columns in a majority of the queries in your workloads to specify a single partition.
If the assumption made by the Optimizer that the table (not based on a CASE_N partitioning expression) has 65,535 partitions provides good plans in these cases, you do not need to alter partitioning, and the system produces the plans and data maintenance performance your site requires.
Workload Characteristics and Partitioning
Consider the following points when you define the partitioning for a table or join index:
- At minimum, you must choose a primary index (or specify no primary index) to achieve an even distribution of rows to the AMPs.
- When appropriate, the primary index column set ought to be constructed from columns that are often constrained by equality conditions in queries in order to provide fast access, joins, and aggregations on those columns.
If the partitioning column set is included in the primary index column set, some concerns are not an issue. However, you would not add the partitioning columns to the primary index to avoid these concerns anyway, because there is usually little, if any, benefit in doing so.
- If the partitioning columns are included in the set of primary index columns, then you can specify the primary index to be UNIQUE.
Adding a partitioning column as a primary index column just to make the primary index unique is not effective, however, because it is the original set of primary index columns that needs to be unique and possibly used for access, joins, and aggregations.
- If the partitioning columns are also included in the set of primary index columns, it might be a good choice to define many combined partitions for primary index access and joins. Be aware that plan costing can be affected if there are too many empty combined partitions if PARTITION statistics are not collected. Other factors can also lead to having fewer combined partitions.
- If all of the partitioning columns are not included in the primary index column set, the primary index cannot be defined as a UPI.
If the primary index columns must be unique, define a USI on them.
Because MultiLoad and FastLoad do not support USIs, you must use another load strategy such as any of the following:
- Loading the rows using Teradata Parallel Data Pump (see Teradata Parallel Data Pump Reference).
- Loading the rows into a staging table followed by an INSERT … SELECT or MERGE into the target table using error logging (see “CREATE ERROR TABLE” in SQL Data Definition Language and “INSERT … SELECT” and “MERGE” in SQL Data Manipulation Language).
- Dropping the USI, loading the rows using MultiLoad or FastLoad, and then recreating the USI (see Teradata FastLoad Reference and Teradata MultiLoad Reference).
- Including the partitioning columns in the PI, noting the previously documented problems with this approach.
You must evaluate the tradeoffs among these choices carefully.
- If the primary index columns need to be an efficient access path and there are many combined partitions, consider one of these options:
- Defining a USI on the primary index columns to improve access time.
- Defining a NUSI on the primary index columns to improve access time.
- Creating a join index to cover queries made against the table.
- Creating a hash index to cover queries made against the table.
- Defining a USI or NUSI on the primary index columns to improve access time.
- Consider defining fewer combined partitions when a table is also accessed or joined on the primary index.
Workload Characteristics and Joins
Consider the following points when you write join queries against a partitioned table:
- Specify equijoins on the primary index and partitioning column sets, if possible, in order to prejudice the Optimizer to use efficient RowKey-based joins (see SQL Request and Transaction Processing).
Consider including the partitioning column in the nonpartitioned primary index table so you can join on the partition column. This means that, depending on the situation, you might want to consider denormalizing the physical schema to enhance the performance of partitioned table-to-nonpartitioned table joins.
- If you specify an equijoin on the primary index column set, but not on the partitioning column set, the fewer combined partitions that exist after any row partition elimination, the better.
Otherwise, the table might need to be spooled and sorted.
The Optimizer can specify sliding-window joins when there are a small number of participating combined row partitions (see SQL Request and Transaction Processing).
- Use RANGE_N to define fewer partitions and specify conditions on the row partitioning columns to reduce the number of combined row partitions involved in the join by evoking partition elimination.
To ensure that the Optimizer creates good query plans for your partitioned tables, you should always collect PARTITION statistics and keep them current.
If you have not collected PARTITION statistics, the Optimizer does not know whether a combined row partition is empty or not, so it has to assume all defined combined row partitions might have rows with respect to the plan it generates; however, it might choose among several such plans based on the estimated number of populated combined row partitions.
- Dynamic row partition elimination for a product join improves performance when a partitioned table and another table are equijoined on the row partitioning column of the partitioned table (see SQL Request and Transaction Processing).
Remember to collect statistics (see “COLLECT STATISTICS (Optimizer Form)” in SQL Data Definition Language) on all of the following.
- The primary indexes of both tables.
- The partitioning columns of the partitioned table.
- The column in the nonpartitioned table that is equated to the partitioning column of the partitioned table.
- The system-derived PARTITION column of all partitioned tables.
The recommended practice for recollecting statistics is to set appropriate thresholds for recollection using the THRESHOLD options of the COLLECT STATISTICS statement. See “COLLECT STATISTICS in SQL Data Definition Language for details on how to do this.
General Recommendations for Using Row-Partitioned Tables and Join Indexes
To take optimal advantage of row-partitioning as it is used in Teradata Database for row-partitioned tables and join indexes, you must have a thorough understanding of partitioning expressions, the general notion of partitioning, and the specific attributes that partitioning brings to a table. The placement of data on the AMPs and the use of row partition elimination by the system can significantly improve the performance of some queries, while at the same time degrading the performance of other queries. You must consider the impact of partitioning on data maintenance. You must be aware that partitioning increases the size of each row header in a partitioned table or index by either 2 or 8 bytes (partitioned table rows are 4 bytes wider if multivalue compression is specified for the table) and that partitioning also increases the size of each secondary index row by 2 or 8 bytes for each referencing ROWID in the index.
Partitioning is a physical database design consideration, and like any other physical database design issue, it is more likely to work well if you have done a good logical database design first.
You should not focus on any one aspect of partitioning while undertaking the process of creating the physical design for your databases. Each and every one of the following attributes must work well together to ensure the success of your partitioned tables and join indexes.
- The partitioning expression
This includes both those queries designed specifically to access the partitioned table and the general class of all queries that are likely to access it.
- Performance, access methods, join strategies, row partition elimination
- Ease of altering the partitioning expression
- Effects of the row partitioning on data maintenance for the table
- Backup and restore operations on the table
A successful partitioning expression is one that takes advantage of row partition elimination, supports ease of partition altering with ALTER TABLE, and has no significant negative impact on data maintenance.
You should always experiment with your intended uses of partitioned tables, considering and analyzing performance tradeoffs between using partitioning or not, various partitioning strategies, and using partitioning along with, or instead of, other indexing such as secondary, hash, and join indexes.
Analyze the maintenance process choices and their performance. Note that additional maintenance is required if you define a USI to enforce uniqueness on a column set of a partitioned table.
Finally, you must always ensure that you are getting the results that you expect. Be sure to review EXPLAIN reports, looking for row partition elimination and rowkey-based joins. Defining a sophisticated partitioning expression is helpful only if the queries in your workloads are able to invoke row partition elimination. Be sure to measure performance for the query workload and for critical queries both before and after creating the partitioned table or join index. Never assume that partitioning will improve the performance of your maintenance workloads, verify it. And always weigh the costs against the benefits.