15.00 - Effect of Skewed Data on Column-Partitioned Tables - Teradata Database

Teradata Database Design

prodname
Teradata Database
vrm_release
15.00
category
User Guide
featnum
B035-1094-015K

Effect of Skewed Data on Column-Partitioned Tables

For a primary‑indexed table, Teradata Database hashes each row based on the value of its primary index. The hash value determines to which AMP the row is dispatched and stored. However, for a NoPI table or column‑partitioned table, Teradata Database sends rows to the AMPs randomly, using single-row INSERT requests, array INSERT requests, and Teradata Parallel Data Pump requests.

This randomization typically is not a problem for data skewing, and the data nearly always balances its distribution of rows among the AMPs. However, there are several operations on NoPI tables and column‑partitioned tables that can lead to data skewing.

  • INSERT … SELECT requests into a NoPI target table or column‑partitioned target table
  • When the target table of an INSERT … SELECT is a column‑partitioned/NoPI table, data coming from the source table, whether it is directly from the source table or from an intermediate spool, is locally inserted into the target column‑partitioned/NoPI table. Performance wise, this is very efficient since it avoids a redistribution and sort. But if the source table or the resulting spool is skewed, the target column‑partitioned/NoPI table is also skewed; in this case, a HASH BY clause can be used to redistribute the data from the source before the local copy.

    For the expressions to hash on, consider expressions that provide good distribution and, if appropriate, improve the effectiveness of autocompression for the insertion of rows back into the column‑partitioned table. Alternatively, use a HASH BY RANDOM clause for a good distribution if there is not a clear choice for the expressions to hash on and for more efficient redistribution.

    When inserting into a column‑partitioned table, also consider use of an LOCAL ORDER BY clause to improve the effectiveness of autocompression.

  • Down AMP Recovery Processing
  • When an AMP in a cluster goes down, Teradata Database reroutes the data dispatched to the down AMP for insertion to a fallback AMP instead.

    When the down AMP comes back online, recovery begins and copies the data from the fallback AMPs for the down AMP that it missed while it was down.The process synchronizes data on all of the AMPs before you can open the table for more work.

    If a cluster has more than 2 AMPs, there are multiple fallback AMPs for each AMP in the cluster.

    For a NoPI or column‑partitioned table, Teradata Database sends the data randomly to the AMPs for insertion with a randomly‑generated hash value. This hash value is not the hash value that is stored in the row. When an AMP is down in a cluster, multiple fallback AMPs can receive data on behalf of the down AMP. Because there is normally one hash bucket on an AMP that is used for all of the rows in a NoPI or column‑partitioned table, when the down AMP comes back online, the rows from the fallback AMPs have the same hash bucket and would create a skewing problem if they were sent to the previously down AMP as part of the recovery.

    Instead, there is only one fallback AMP for each AMP in a cluster for NoPI and column‑partitioned tables.

    Multiple AMPs can still receive rows on behalf of a down AMP, but only one AMP that puts the data of the down AMP into a fallback data subtable, and that is the data goes to the down AMP when it comes back up.

    Other AMPs put the data in their primary data subtable. This can be a problem when there are more than 2 AMPs in a cluster and there is a down AMP while a lot of data is inserted into a nonpartitioned or column‑partitioned table, and the data in that NoPI or column‑partitioned table can be skewed when the down AMP comes back online. 2‑AMP clusters do not have this problem.

    If a column‑partitioned or NoPI table becomes skewed and that table is populated from other source tables, one option to handle the problem is to delete all the rows in the table and then repopulate the table from its source tables using an INSERT … SELECT request with a HASH BY clause as needed to provide good distribution.

    If this option is not available to you, you can use an INSERT … SELECT request with a HASH BY clause to move the data from the skewed column‑partitioned or table into another column‑partitioned or NoPI table and then drop the original column‑partitioned or NoPI table.

    When you insert rows into a column‑partitioned table, you should always consider specifying a LOCAL ORDER BY clause to improve the effectiveness of autocompression.

  • Reconfiguration Server Utility
  • The Reconfiguration utility has a similar issue to that of the Restore and Copy client utilities when data is moved to a different configuration, particularly a configuration with more AMPs, which is normally the case when you expand your system.

    Although rows in a column‑partitioned or NoPI table are not hashed based on their primary index values, and the AMPs on which the rows reside are determined arbitrarily, each row does have a rowID with a hash bucket that is owned by the AMP storing that row.

    Teradata Database redistributes the rows in a column‑partitioned or NoPI table using the Reconfiguration utility by sending each row to the AMP that owns the hash bucket in that row based on the new configuration map. As is true for the Restore and Copy utilities, Reconfiguration can skew the distribution of rows for a column‑partitioned or nonpartitioned NoPI table by relocating them on a different configuration. And the more AMPs there are in a configuration, the fewer hash buckets there are to go around for the table.

    It is also possible that multiple hash buckets can be redistributed to the same AMP. With fewer AMPs, it is guaranteed that some AMPs have multiple hash buckets. Because of this, significantly more space can be required for a column‑partitioned or NoPI table than is required for a a comparable primary‑indexed table.

    To avoid this problem, you can use an INSERT … SELECT request to move the column‑partitioned or NoPI table rows into a primary‑indexed table before the Reconfiguration and delete all the rows from the column‑partitioned or NoPI table.

    After the Reconfiguration completes, you can use an INSERT … SELECT request to move the rows from the primary‑indexed table back into the column‑partitioned or NoPI table. For the primary index of the primary‑indexed table, consider using one that provides good distribution and, if appropriate, improves the effectiveness of autocompression for the insert back into the column‑partitioned or NoPI table.

    If the column‑partitioned or NoPI table is populated from other source tables, you can also delete all the rows from the column‑partitioned or NoPI table and then after the Reconfiguration completes, repopulate the column‑partitioned or NoPI table from the source tables using an INSERT … SELECT request with a HASH BY clause as needed to provide good distribution.

    If you do neither, and the row distribution for the column‑partitioned or NoPI table is skewed after the Reconfiguration, use an INSERT … SELECT request with a HASH BY clause from the skewed column‑partitioned or NoPI table into another column‑partitioned or NoPI table and then drop the original column‑partitioned or NoPI table.

    When inserting rows into a column‑partitioned table, always consider using a LOCAL ORDER BY clause to improve the effectiveness of autocompression.

  • Restore and Copy Client Utilities
  • When the AMP configuration, the hash function of the source system, and the target system are part of the same restore or copy operation, there is no data redistribution involved. As a result, the rows that were originally stored on an AMP are restored or copied to that same AMP.

    When this happens, the data demographics are identical between the source and target systems. The data demographics for a primary‑indexed table can change when the AMP configuration or the hash function of the source system and the target system is different. This is because Teradata Database rehashes the data in this case, and you restore or copy it to a different AMP on the target system from that on the source system.

    For a column‑partitioned or NoPI table, there is no rehash even if the hash function is different between the target and source systems because there is no primary index on which to do a rehash. The hash bucket in a row coming from the source system does not change when it is stored on the target system, so as long as there is no change in AMP configuration, rows in a column‑partitioned or NoPI table are restored or copied to the same AMP as they were on for the source system.

    A problem can occur when there is a change in AMP configuration. Because there is normally one hash bucket used on an AMP for a column‑partitioned or NoPI table, when the target system has more AMPs than the source system there are not enough hash buckets in the table to go to all of the AMPs, so some AMPs do not have any data after a restore or copy operation. This skews the column‑partitioned or NoPI table rows and can affect performance negatively. Furthermore, multiple hash buckets go to the same AMPs as the result of a restore or copy operation when the target system has fewer AMPs than the source system. If the restore or copy operation is to a target system with fewer AMPs than the source system, this might or might not happen. The result is that much more space could be required for a column‑partitioned or NoPI table than a primary‑indexed table.

    To avoid this problem, you can use an INSERT … SELECT request to move the column‑partitioned or NoPI table rows into a primary‑indexed table for archive, and do not archive the column‑partitioned or NoPI table. Then after restoring the rows, use an INSERT … SELECT request to move the rows from the primary‑indexed table into a column‑partitioned or NoPI table.

    Consider using a primary index that provides good row distribution for the primary‑indexed table and consider a primary index that provides good distribution and, if appropriate, improves the effectiveness of autocompression for the insertion back into the column‑partitioned or NoPI table.

    If the column‑partitioned or NoPI table is populated from other source tables, you can also archive the source tables (not the column‑partitioned or NoPI table) and then after restoring it, populate the column‑partitioned or NoPI table from the source tables using an INSERT … SELECT request with a HASH BY clause as needed to provide good distribution.

    If the column‑partitioned or NoPI table is part of the archive, for example, if the archive were made prior to knowing system configuration or hash would be different, do not restore the column‑partitioned or NoPI table, and after restoring the source tables, populate the column‑partitioned or NoPI table.

    If you do neither is done and the column‑partitioned or NoPI table is restored in a skewed state, use an INSERT … SELECT request with a HASH BY clause from the skewed column‑partitioned or NoPI table into another column‑partitioned or NoPI table and then drop the original column‑partitioned or NoPI table.

    When you insert rows into a column‑partitioned table, you should always consider using a LOCAL ORDER BY clause to improve the effectiveness of autocompression.