Row Partition Elimination
Row partition elimination is a method for enhancing query performance against row-partitioned tables by skipping row partitions that do not contain rows that meet the search conditions of a query. Row partition elimination is an automatic optimization in which the Optimizer (or, in the case of dynamic row partition elimination,the AMP software) determines, based on query conditions and a row partitioning expression, that some partitions for that partitioning expression cannot contain qualifying rows; therefore, those row partitions can be skipped during a file scan.
Partitions that are skipped for a particular query are called eliminated row partitions.When there are multiple partitioning expressions, Vantage combines row partition elimination at each of the levels to further reduce the number of data subsets that need to be scanned. For most applications,the greatest benefit of row partitioning is obtained from row partition elimination.
Basing Partitioning on Modulo Partitioning of a Numeric Column
This form uses a single column with a wide range of values, and maximizes the number of partitions. As an example, assume that a telephone company has a table with detailed information about each outgoing telephone call. One of the columns is the originating phone number, defined with the DECIMAL data type. A possible, and perhaps useful, partitioning expression can be defined as follows:
CREATE TABLE … PRIMARY INDEX (phone_number, call_start) PARTITION BY phone_number mod 65535 + 1;
This partitioning expression, assuming millions of subscribers, populates each of the 65,535 partitions. Some partitions might have more rows than others, because some customers make more phone calls than others, but the distribution among the partitions should be somewhat even. This partitioning expression can improve the performance of a query that examines all phone calls made from a particular number by orders of magnitude by scanning only one partition out of 65,535 instead of the entire table.
Some disadvantages of this form are that the partitioning cannot be altered unless the table is empty, a maximum on 65535 partitions can be defined, and row partition elimination for queries is usually limited to constant or USING value equality conditions on the partitioning column.
Basing the Partitioning Expression on Two or More Numeric Columns
This form uses arithmetic operations, typically multiplication and addition, on two or more numeric columns with suitably small value ranges. Assume a table with a three-digit product code and a two-digit store number. The store numbers count consecutively from 0, and there are fewer than 65 stores. This table can be partitioned as follows:
CREATE TABLE … PRIMARY INDEX (store_number, product_code, sales_date) PARTITION BY store_number * 1000 + product_code;
If many queries specify both store_number and product_code, this might be a useful partitioning expression. One downside is that it fails if the number of products grows to the point that a four-digit number is required, or if the number of stores expands beyond 64.
Note that the Optimizer assumes 65,535 partitions even though some might be empty. The table is not any larger because of the empty partitions, though the Optimizer default assumption that there are 65,535 partitions based on the specification might sometimes mislead it into making suboptimal plan choices.
Some disadvantages of this form are that the partitioning cannot be altered unless the table is empty and row partition elimination for queries is usually limited to constant or USING value equality conditions on both of the partitioning columns.
An alternative is to use multilevel partitioning, as demonstrated by the following CREATE TABLE request:
CREATE TABLE … PRIMARY INDEX (store_number, product_code, sales_date) PARTITION BY (RANGE_N(store_number BETWEEN 0 AND 64 EACH 1), RANGE_N(product_code BETWEEN 1 AND 999 EACH 1));
Basing the Partitioning Expression on a CASE_N Function
You can use the CASE_N function to concisely define a partitioning expression for which each partition contains data based on an associated condition. When you specify CASE_N for single-level partitioning expression, two partition numbers, NO CASE and UNKNOWN, are automatically reserved for specific uses.
The CASE_N function is patterned after the SQL CASE expression (see Teradata Vantage™ - SQL Functions, Expressions, and Predicates, B035-1145). The function returns an INTEGER value numbered from 1, indicating which CASE_N condition first evaluated to TRUE for the particular value. The returned value can map directly to a partition number or be further modified to calculate the partition number.
Assume a table has a total_revenue column, defined as DECIMAL. The table can be partitioned on that column, so that low revenue products are separated from high revenue products. The partitioning expression can be written as follows:
CREATE TABLE … PRIMARY INDEX (store_id, product_id, sales_date) PARTITION BY CASE_N (total_revenue < 10000, total_revenue < 100000, total_revenue < 1000000, NO CASE, UNKNOWN);
This request defines 5 partitions, conceptually numbered from 1 to 5 in the order they are specified in the partitioning expression.
|This partition number …||Represents …|
|1||products with total_revenue less than 10,000.|
|2||products with total_revenue of at least 10,000, but less than 100,000.|
|3||products with total_revenue of at least 100,000 but less than 1,000,000.|
|4 (NO CASE)||any value that does not evaluate to TRUE or UNKNOWN for any previous CASE_N condition, which in this case is total_revenue equal to or greater than 1,000,000.|
|5 (UNKNOWN)||values for which it is not possible to determine the
truth value of a previous CASE_N expression.
For this partitioning condition, a row with a null for total_revenue is assigned to the UNKNOWN partition, because by definition, it is not possible to evaluate whether a null is less than 10,000.
The rows in the NO CASE and UNKNOWN partitions are valid, and the system accesses those partitions to process queries unless the query conditions exclude them. By defining NO CASE and UNKNOWN partitions, you can ensure that any possible value maps to a partition. In the absence of those partitions, some values result in errors and so are not be inserted into the table. In practice, it is probably better to have more than three revenue ranges unless the queries against this table rarely specify narrower revenue ranges.
This example demonstrates that CASE_N can be used to define complicated partitioning expressions tailored to a specific table and specific query workloads:
CREATE TABLE … PRIMARY INDEX (col1, col2) PARTITION BY CASE_N (col3 < 6, col3 >= 8 AND col3 < 10 AND col4 <> 12, col5 <> 10 OR col3 = 20, NO CASE OR UNKNOWN);
Without knowing the meaning and data demographics of the columns, there is no way of knowing whether this partitioning expression is useful.
Note that the NO CASE and UNKNOWN partitions are combined into a single partition in this example.
Unlike the case for the previous partitioning expression examples, the Optimizer knows how many partitions are defined when CASE_N is used as the partitioning expression and does not have to assume a default number of 65,535 partitions.
- The partitioning of the table cannot be altered unless it is empty.
- Row partition elimination for queries is often limited to constant or USING value equality conditions on the partitioning columns.
- The Optimizer might not eliminate some row partitions that it possibly could if the partitioning were better conceived.
- As the number of conditions increases, evaluating a CASE_N function can be costly in terms of CPU cycles, and a CASE_N might also contribute to causing the table header to exceed its maximum size limit. Therefore, you may need to limit the number of conditions you define in the CASE_N function to a relatively small number.
Considerations for Basing a Partitioning Expression on a CASE_N Function
- The partitioning expression defines a mapping between conditions and INTEGER numbers.
- There are limited number of conditions (too many conditions can lead to excessive CPU usage or the table header exceeding its maximum size limit).
- Your query workloads against the table use equality conditions on the partitioning columns to specify a single partition.
- You have no need to alter the partitioning of the table.
- You get the plans and data maintenance you need.
Basing a Partitioning Expression on a RANGE_N Function
The RANGE_N function is provided to simplify the specification of common partitioning expressions where each partition contains a range of numeric data, and is especially useful when the column contains a date or timestamp. RANGE_N returns an INTEGER determined by the first range that includes the column value, numbered from 1, that can be mapped directly to a partition number or can be further modified to calculate the partition number. RANGE_N is commonly used to define partitioning expressions. When you use a RANGE_N function to define a single-level partitioning expression, two partition numbers, NO RANGE and UNKNOWN, are reserved for specific uses.
Assume a table with 7 years of order data, ranging from 2001 through 2007. The next partitioning expression creates 84 partitions, one for each month of the period covered by the data:
CREATE TABLE … PRIMARY INDEX (order_number) PARTITION BY RANGE_N (order_date BETWEEN DATE '2001-01-01' AND DATE '2007-12-31' EACH INTERVAL '1' MONTH);
Each partition contains roughly the same number of rows, assuming that order volume has stayed roughly constant across the seven year interval. Neither a NO RANGE nor an UNKNOWN partition is defined because this partitioning expression definition is for data that only has dates within the ranges specified in the partitioning expression.
It is frequently desirable to have each partition contain roughly the same number of rows, but it is not required. The next example puts the older orders into partitions with coarser granularity, and the newer orders into partitions with finer granularity:
CREATE TABLE … PRIMARY INDEX (order_number) PARTITION BY RANGE_N (order_date BETWEEN DATE '1994-01-01' AND DATE '1997-12-31' EACH INTERVAL '2' YEAR, DATE '1998-01-01' AND DATE '2000-12-31' EACH INTERVAL '1' YEAR, DATE '2001-01-01' AND DATE '2003-12-31' EACH INTERVAL '6' MONTH, DATE '2004-01-01' AND DATE '2007-12-31' EACH INTERVAL '1' MONTH, NO RANGE, UNKNOWN);
In this example, the more recent data is partitioned more finely than the older data. This can be a good strategy if you know that the older data is rarely accessed except as part of a full-table scan, because it reduces some potential disadvantages by defining a smaller number of partitions. However, maintaining this structure over extended epochs of time is not as simple as maintaining a structure in which each interval covers the same time duration. In this example, the years 2004 through 2007 are partitioned by month. As time passes, and 2004 data becomes older and less frequently referenced, it shall become necessary to repartition the table if the pattern of defining longer time intervals for older data is to be maintained.
It is both easy and fast to add and drop partitions from the ends of a partitioning definition, but repartitioning intervals in the middle partitions requires much more work, and it is usually faster to reload the data. In addition, when a range partition is dropped, any rows in that partition are moved from the dropped range partition to the NO RANGE partition or to an added range partition. Also, when a range is added, rows might need to be moved from the NO RANGE partition to the new range partition.
Some expansion room is allowed for future dates by specifying the final partition as extending to the end of 2007. It is easy to add and drop ranges to the end of a partitioning expression using the ALTER TABLE statement. While you could have specified an ending date far in the future, such as 2099-12-31, it is generally not desirable to define hundreds of partitions that shall not be used for decades.
The example shows intervals of years and months. It is also possible to partition by day (EACH INTERVAL '1' DAY) or by week (EACH INTERVAL '7' DAY). A seven-day interval can start on any day of the week, so if you want to start the weekly intervals on Sunday, for example, the beginning date should be chosen so that it falls on a Sunday. Also, the last range might be less than specified by the EACH clause. For example, suppose that with seven-day intervals the first date falls on Sunday, but the last date also falls on Sunday, in which case the last range spans only one day.
The RANGE_N NO RANGE clause is comparable to the CASE_N NO CASE clause, and the UNKNOWN clause has the same meaning for both functions.
As with the CASE_N partitioning expression, the Optimizer knows how many partitions are defined when RANGE_N is used as the partitioning expression.
Considerations for Basing the Partitioning Expression on a RANGE_N Function
- Defining an efficient mapping or ranges between integer (BYTEINT, SMALLINT, INTEGER, BIGINT), character (CHARACTER, GRAPHIC, VARCHAR, VARCHAR(n) CHARACTER SET GRAPHIC), DATE, or TIMESTAMP type and INTEGER numbers.
- Provides more opportunities than other expressions to optimize queries.
- The Optimizer knows the number of defined partitions when you specify RANGE_N to define the partitioning expression.
For partitioning expressions other than RANGE_N or CASE_N, the Optimizer generally assumes a total of 65,535 partitions when statistics have not been collected, which could easily be far more than the number of populated partitions. However, collecting statistics on PARTITION can provide information about which partitions are empty.
- Faster partitioning changes than any other expression using the following
ALTER TABLE options.
- ADD RANGE
- DROP RANGE
- Reference only a single integer (BYTEINT, SMALLINT, INTEGER, BIGINT),
character (CHARACTER, GRAPHIC, VARCHAR, VARCHAR(n) CHARACTER SET GRAPHIC), DATE, or TIMESTAMP
column, not expressions.
For example, specifying a simple expression such as x/10 in place of a column name in the RANGE_N specification, even if the expression references only a single column, can hinder, or even prevent, row partition elimination.
- For equal-sized ranges, always specify an EACH clause.
Note the following collateral facts about equal- and unequal-sized partitions:
- The performance of unequal-sized partitions varies depending on which partitions are accessed.
- Unequal size ranges can prevent fast partitioning changes from being made using the ALTER TABLE statement (see Teradata Vantage™ - SQL Data Definition Language Syntax and Examples, B035-1144).
- Using the NO RANGE, NO RANGE OR UNKNOWN, or UNKNOWN specifications for a
range can negatively affect later ALTER TABLE partitioning strategies.
Do not use these clauses unless you have specific reasons for doing so (see Teradata Vantage™ - SQL Functions, Expressions, and Predicates, B035-1145 for details).Reasons not to use the NO RANGE, NO RANGE OR UNKNOWN, and UNKNOWN clauses include the following.
- The maintenance and use of partitioned tables is simpler if you do not use these options, and avoiding their use also prevents bad data from being inserted into the table.
- If these partitions are not eliminated by row partition elimination, they can cause negative performance impacts if they contain a large of number of rows.
- If a partitioning column is NOT NULL such that test values can never be null, do not specify UNKNOWN or NO RANGE OR UNKNOWN.
- If the specified ranges cover all possible values, do not specify NO RANGE or NO RANGE OR UNKNOWN.
Basing the Partitioning Expression on a RANGE_N Character Column
The RANGE_N function provides a simplified method for mapping a character value into one of a list of specified ranges, and then returning the number of that range. A range is defined by its starting and ending boundaries, inclusively.
If you do not specify an ending boundary, the range is defined by its starting boundary, inclusively, up to, but not including, the starting boundary for the next range.
The EACH clause is not supported for character test values, and the system returns an error if you specify an EACH clause in a RANGE_N function with a character test value. Each range specified for expressions character data types maps to exactly an integer range of only one value. Therefore, the number of partitions that can be specified is somewhat limited because of table header limitations and due to the increase in CPU usage to handle a large number of ranges.
You can specify an asterisk for the first starting boundary to indicate the lowest possible value, and you can specify an asterisk for the last ending boundary to indicate the highest possible value.
As with numeric RANGE_N expressions, options are provided to handle cases when the value does not map into one of the specified ranges or evaluates to UNKNOWN because of a null result, making it impossible to determine into which range the value would map.
- You can specify only one test value, and it must result in an integer (BYTEINT, INTEGER, SMALLINT, BIGINT), character (CHARACTER, GRAPHIC, VARCHAR, VARCHAR(n) CHARACTER SET GRAPHIC), DATE, or TIMESTAMP data type.
- A RANGE_N partitioning expression can specify the UPPERCASE qualifier and
the following functions.
- VARCHAR(n) CHARACTER SET GRAPHIC
- The database aborts the request and returns an error if any of the specified ranges are defined with null boundaries, are not increasing, or overlap. Increasing order is determined using the session collation and the case sensitivity specification for the test value at the time the table is created.
Using CASE_N and RANGE_N in SELECT Requests
You can also use the CASE_N and RANGE_N functions in a SELECT request. For example, you might use them to help determine the distribution of rows among partitions for a proposed partitioning expression. For example, before deciding to partition on division_number, you might want to check the resulting distribution with a request like the following.
SELECT RANGE_N(division_number BETWEEN 1 AND 4 EACH 1) AS p, COUNT(*) AS c, FROM sales GROUP BY p ORDER BY p;
Another use of RANGE_N is to determine the number of ranges defined, for example, as follows.
SELECT RANGE_N(DATE '2007-12-31' BETWEEN DATE '2001-01-01' AND DATE '2007-12-31' EACH INTERVAL '30' DAY);
This query returns the value 86 because the last range is less than 30 days.
The following query over the same data returns the value 84 because the number of days per range varies between 28 and 31, depending on the month and year.
SELECT RANGE_N(DATE '2007-12-31' BETWEEN DATE '2001-01-01' AND DATE '2007-12-31' EACH INTERVAL '1' MONTH);
The final example returns the value 13 because the last range is only one day in length.
SELECT RANGE_N(DATE '2002-01-01' BETWEEN DATE '2001-01-01' AND DATE '2002-01-01' EACH INTERVAL '1' MONTH);
Workload Characteristics, Queries, and Row Partition Elimination
- Row partition elimination is most effective with constant conditions on the partitioning columns.
- When a row partitioning expression is written using something other than the RANGE_N function or a single column, row partition elimination is most effective when you specify constant equality conditions.
- Row partition elimination can also be effective with equality conditions on USING variables if the conditions specify a single partition.
- Row partition elimination occurs for CURRENT_DATE and DATE built-in functions for inequality conditions. This does not prevent the request from being cached.
- Row partition elimination might occur for other built-in functions and USING variables in inequality conditions, and if it does, the action prevents the system from caching the request.
- Multiple ORed equality conditions on the same row partitioning column do
not invoke partition elimination.
As an alternative, you should try either to use the UNION operator on two SELECT requests or to substitute constants for the USING variables in any inequality conditions.
- Use simple comparison of a partitioning column to a constant, built-in
function, or USING variable expression for your query conditions. For example:
- d=10d >= 10 AND d <= 12
- d BETWEEN 10 AND 12d = 10+1
- d IN (20, 22, 24)d = 20 OR d=21
- d = :udd
- d BETWEEN CURRENT_DATE-7 and CURRENT_DATE-1
- Avoid specifying query conditions with expressions or functions constructed on the row partitioning column. For example, use the form in Example 2 rather than the form in Example 1, and the form in Example 4 rather than the form in Example 3.
The predicate in this query is based on an expression constructed on the row partitioning column x:
CREATE TABLE ... PARTITION BY x; SELECT ... WHERE x+1 IN (2,3);
The predicate in this query is based only on the value of the row partitioning column x:
CREATE TABLE ... PARTITION BY RANGE_N(x BETWEEN 1 AND 65533 EACH 1);
SELECT ... WHERE x IN (1,2);
The predicate in this query is based only on the value of row partitioning column x even though the table is partitioned by the expression x + 1:
CREATE TABLE ... PARTITION BY x+1; SELECT ... WHERE x IN (1,2);
The predicate in this query is based only on the value of row partitioning column x:
CREATE TABLE ... PARTITION BY RANGE_N(x BETWEEN 0 AND 65532 EACH 1);
SELECT ... WHERE x IN (1,2);
Workload Characteristics and Row Partitioning
- The number of queries in workloads that access the table has a range constraint, particularly a date constraint on some column of the table.
- Queries have an equality constraint on some column of the table, and that column is either not the only primary index column or it is not a primary index column at all.
- If there is a primary index that is used only, or principally, to achieve an even distribution of rows, but not usually for accessing or joining rows, and access is frequently made on a column that is suitable for partitioning.
- If there is a primary index that is used to achieve an even distribution of rows as well as for accessing or joining rows, and columns suitable for partitioning are included in the primary index definition.
- If there is a primary index that is used to achieve an even distribution
of rows as well as for accessing or joining rows, but columns suitable for partitioning are
not included in the primary index definition.
This might be a good candidate for partitioning, but you need to pay particular attention to weighing the performance tradeoffs that often result in this situation.
- Use the RANGE_N function for a partitioning expression, preferably on a
column with a DATE or TIMESTAMP data type, because it generally provides more opportunities
for row partition elimination, and the Optimizer knows the exact number of defined row
For instance, instead of the following row partitioning expression.
PARTITION BY column
PARTITION BY RANGE_N(column BETWEEN m AND n EACH s)
Dates and timestamps are often used in query conditions and therefore makes good candidates for a partitioning expression.
When partitioning on a DATE column, use RANGE_N with a single overall range divided into ranges of equal size as follows.
PARTITION BY RANGE_N(date_column BETWEEN DATE '...' AND DATE '...' EACH INTERVAL 's' t)
- Use DATE constants or TIMESTAMP constants such as DATE '2011-08-06' or TIMESTAMP '2011-08-25
10:14:59' to specify the ranges in a partitioning expression. This is not only
easier to read, making it clear that the expression is defined on a date, but it also removes
the dependence on the FORMAT used in the implicit conversion to a date. You can also specify
TIMESTAMP(n) WITH TIME ZONE constants for RANGE_N
For example, you can specify a RANGE_N-based partitioning expression like the following.
RANGE_N(ts BETWEEN TIMESTAMP '2003-01-01 00:00:00+13:00' AND TIMESTAMP '2009-12-31 23:59:59-12:59' EACH INTERVAL '1' MONTH)
Use an INTERVAL constant in the EACH clause where the variable t is DAY, MONTH, YEAR, or YEAR TO MONTH.
Do not use INTEGER values or CHARACTER constants for dates in your partitioning expressions, since they can easily be incorrectly specified such that they do convert to the date you expected.
For example, it might seem intuitive to simply partition by the name of a DATE column as follows:
PARTITION BY sales_date
This form does not produce a syntax error. In fact, it works correctly only for dates in the early 1900s and follows the rule of implicit conversion to get an INTEGER partition number, but such a table is not generally useful.
For this case, you should instead use RANGE_N with a granularity of EACH INTERVAL '1' DAY.
It might also seem intuitive to specify something like the following to indicate that a date column in the primary index is to be partitioned by week:
PARTITION BY 7
However, this form does produce a syntax error.
For this case, use RANGE_N with a granularity of EACH INTERVAL '7' DAY.
- Consider specifying only as many date ranges as are currently needed plus
a few additional ranges for the future.
By limiting ranges to those that are currently needed, you help the Optimizer to better cost plans and also allow for more efficient primary index access, joins, and aggregations when the partitioning column is not included in the primary index.
This is not as important if you collect current PARTITION statistics.
A good guideline is to define 10% or fewer of the partitions to be empty to be able to handle future dates. You should also define enough future ranges to minimize the frequency of ALTER TABLE statements needed to drop and add ranges. However, if you make changes too infrequently, you might forget to alter the table entirely.Altering the table only once a year is probably not a good idea for the following reasons:
- Because you must create too many empty partitions.
- Because it is too easy to forget to alter the partitioning ranges if you do not do it fairly regularly.
- Because you fail to follow the procedure often enough to prevent
problems from occurring when you finally get around to following it.
You must balance these concerns of having enough future partitions, but not too many.
- RANGE_N permits a faster partitioning change using ALTER TABLE … DROP RANGES or ALTER TABLE ... ADD RANGES.
- Reference a single INTEGER or DATE column in the RANGE_N function.
Do not use an expression such as x/10 in place of a simple column reference in a partitioning expression constructed from a RANGE_N function even if the expression only references a single column.
- Define ranges of equal size using EACH to specify the granularity of the
Multiple ranges, with or without specifying an EACH granularity, require more CPU time to execute and different sized ranges can prevent fast partitioning changes.
- Consider not specifying the NO RANGE, NO RANGE OR UNKNOWN, or UNKNOWN
partitions in the RANGE_N function.
Using these partitions can degrade query performance because queries can be forced to scan data in these partitions unnecessarily. If you specify the NO RANGE, NO RANGE OR UNKNOWN, or UNKNOWN partitions, you can also affect the performance of ALTER TABLE partitioning change negatively because data might need to be moved among the partitions. An UNKNOWN partition is not needed if the partitioning cannot produce rows with unknown partition values (this is often the case when a partitioning column is specified to be NOT NULL).
- Deciding not to use a RANGE_N function for a partitioning expression can
be a good choice in the following instances:
- You do not partition the table or join index on a DATE column.
- You use constant or USING variable equality conditions on the partitioning columns in a majority of the queries in your workloads to specify a single partition.
If the assumption made by the Optimizer that the table (not based on a CASE_N partitioning expression) has 65,535 partitions provides good plans in these cases, you do not need to alter partitioning, and the system produces the plans and data maintenance performance your site requires.
Workload Characteristics and Partitioning
- You must choose a primary index, primary AMP index, or no primary index to achieve an even distribution of rows to the AMPs.
- When appropriate, the primary index or primary AMP index column set ought to be constructed from columns that are often constrained by equality conditions in queries in order to provide fast access, joins, and aggregations on those columns.
If the partitioning column set is included in the primary index column set, some concerns are not an issue. However, you would not add the partitioning columns to the primary index to avoid these concerns anyway, because there is usually little, if any, benefit in doing so.
- If the partitioning columns are included in the set of primary index
columns, then you can specify the primary index to be UNIQUE.
Adding a partitioning column as a primary index column just to make the primary index unique is not effective, however, because it is the original set of primary index columns that needs to be unique and possibly used for access, joins, and aggregations.
- If the partitioning columns are also included in the set of primary index columns, it might be a good choice to define many combined partitions for primary index access and joins. Be aware that plan costing can be affected if there are too many empty combined partitions if PARTITION statistics are not collected. Other factors can also lead to having fewer combined partitions.
- If all of the partitioning columns are not included in the primary index column set, the primary index cannot be defined as a UPI.
If the primary index or primary AMP index columns must be unique, define a USI on them.Because MultiLoad and FastLoad do not support USIs, you must use another load strategy such as any of the following:
- Loading the rows using Teradata Parallel Data Pump (see Teradata® Parallel Data Pump Reference, B035-3021).
- Loading the rows into a staging table followed by an INSERT ... SELECT or MERGE into the target table using error logging (see the information about CREATE ERROR TABLE in Teradata Vantage™ - SQL Data Definition Language Syntax and Examples, B035-1144 and the information about INSERT ... SELECT and MERGE in Teradata Vantage™ - SQL Data Manipulation Language, B035-1146).
- Dropping the USI, loading the rows using MultiLoad or FastLoad, and then recreating the USI (see Teradata® FastLoad Reference, B035-2411 and Teradata® MultiLoad Reference, B035-2409).
- Including the partitioning columns in the PI, noting the previously documented problems with this approach.
You must evaluate the tradeoffs among these choices carefully.
- If the primary index or primary AMP index columns need to be an efficient access path and there are many combined partitions, consider one of these options:
- Defining a USI on the primary index columns to improve access time.
- Defining a NUSI on the primary index columns to improve access time.
- Creating a join index to cover queries made against the table.
- Creating a hash index to cover queries made against the table.
- Consider defining fewer combined partitions when a table is also accessed or joined on the primary index.
Workload Characteristics and Joins
- Specify equijoins on the primary index and partitioning column sets, if
possible, in order to prejudice the Optimizer to use efficient RowKey-based joins (see Teradata Vantage™ - SQL Request and Transaction Processing, B035-1142).
Consider including the partitioning column in the nonpartitioned primary index table so you can join on the partition column. This means that, depending on the situation, you might want to consider denormalizing the physical schema to enhance the performance of partitioned table-to-nonpartitioned table joins.
- If you specify an equijoin on the primary index column set, but not on the
partitioning column set, the fewer combined partitions that exist after any row partition
elimination, the better.
Otherwise, the table might need to be spooled and sorted.
The Optimizer can specify sliding-window joins when there are a small number of participating combined row partitions (see Teradata Vantage™ - SQL Request and Transaction Processing, B035-1142).
- Use RANGE_N to define fewer partitions and specify conditions on the row
partitioning columns to reduce the number of combined row partitions involved in the join by
evoking partition elimination.
To ensure that the Optimizer creates good query plans for your partitioned tables, you should always collect PARTITION statistics and keep them current.
If you have not collected PARTITION statistics, the Optimizer does not know whether a combined row partition is empty or not, so it has to assume all defined combined row partitions might have rows with respect to the plan it generates; however, it might choose among several such plans based on the estimated number of populated combined row partitions.
- Dynamic row partition elimination for a product join improves performance when a partitioned table and another table are equijoined on the row partitioning column of the partitioned table (see Teradata Vantage™ - SQL Request and Transaction Processing, B035-1142).Remember to collect statistics (see the information about COLLECT STATISTICS (Optimizer Form) in Teradata Vantage™ - SQL Data Definition Language Syntax and Examples, B035-1144) on all of the following.
- The primary indexes of both tables.
- The partitioning columns of the partitioned table.
- The column in the nonpartitioned table that is equated to the partitioning column of the partitioned table.
- The system-derived PARTITION column of all partitioned tables.
The recommended practice for recollecting statistics is to set appropriate thresholds for recollection using the THRESHOLD options of the COLLECT STATISTICS statement. For details, see Teradata Vantage™ - SQL Data Definition Language Syntax and Examples, B035-1144.
General Recommendations for Using Row-Partitioned Tables and Join Indexes
To take optimal advantage of row-partitioning as it is used in Vantage for row-partitioned tables and join indexes, you must have a thorough understanding of partitioning expressions, the general notion of partitioning, and the specific attributes that partitioning brings to a table. The placement of data on the AMPs and the use of row partition elimination by the system can significantly improve the performance of some queries, while at the same time degrading the performance of other queries. You must consider the impact of partitioning on data maintenance. You must be aware that partitioning increases the size of each row header in a partitioned table or index by either 2 or 8 bytes (partitioned table rows are 4 bytes wider if multivalue compression is specified for the table) and that partitioning also increases the size of each secondary index row by 2 or 8 bytes for each referencing ROWID in the index.
Partitioning is a physical database design consideration, and like any other physical database design issue, it is more likely to work well if you have done a good logical database design first.
- The partitioning expression
This includes both those queries designed specifically to access the partitioned table and the general class of all queries that are likely to access it.
- Performance, access methods, join strategies, row partition elimination
- Ease of altering the partitioning expression
- Effects of the row partitioning on data maintenance for the table
- Backup and restore operations on the table
A successful partitioning expression is one that takes advantage of row partition elimination, supports ease of partition altering with ALTER TABLE, and has no significant negative impact on data maintenance.
You should always experiment with your intended uses of partitioned tables, considering and analyzing performance tradeoffs between using partitioning or not, various partitioning strategies, and using partitioning along with, or instead of, other indexing such as secondary, hash, and join indexes.
Analyze the maintenance process choices and their performance. Note that additional maintenance is required if you define a USI to enforce uniqueness on a column set of a partitioned table.
Finally, you must always ensure that you are getting the results that you expect. Be sure to review EXPLAIN reports, looking for row partition elimination and rowkey-based joins. Defining a sophisticated partitioning expression is helpful only if the queries in your workloads are able to invoke row partition elimination. Be sure to measure performance for the query workload and for critical queries both before and after creating the partitioned table or join index. Never assume that partitioning will improve the performance of your maintenance workloads, verify it. And always weigh the costs against the benefits.