15.00 - Bulk Loading a Column-Partitioned Table - Teradata Database

Teradata Database Design

prodname
Teradata Database
vrm_release
15.00
category
User Guide
featnum
B035-1094-015K

Bulk Loading a Column‑Partitioned Table

The expected method of populating a column‑partitioned table is an INSERT … SELECT request from one or more source tables.

If the data is from an external source, you can use FastLoad to load the data to a staging table and then populate the column‑partitioned table with an INSERT … SELECT request. You can also populate a column‑partitioned table from an external source using a Teradata Parallel Data Pump array INSERT to insert data into the column‑partitioned table. However, Teradata Parallel Data Pump is not expected to be as efficient as a FastLoad and INSERT … SELECT request.

You should rarely update column‑partitioned tables using single-row INSERT requests because such requests can cause a large degradation in performance by needing to append a column partition value to each of the column partitions.

Inserting data into a column‑partitioned table is expected to gain some efficiency over a primary indexed‑table because the data is just appended to the end of a table or row partition, and the rows are not required to be in any particular order. However, this can be negatively offset for a column‑partitioned table by the need to transform rows into columns.

Bulk inserts using an INSERT … SELECT request or Teradata Parallel Data Pump array INSERT can minimize this impact for column‑partitioned tables because they can apply the transformation of multiple rows to columns as a set instead of individually. Teradata Parallel Data Pump array inserts can group as many data rows as are allowed in an AMP step because the rows are not required to be sent to a specific AMP.

The transformation from rows to columns performs better if the number of column partitions (not including the delete column partition) does not exceed the number of available column partition contexts; otherwise, Teradata Database must make additional scans of the source data.

For INSERT … SELECT (without a HASH BY clause) requests into a NoPI or column‑partitioned target table, Teradata Database does not redistribute data from the source table, but instead locally appends it to the target NoPI table. Keep in mind that this results in skew in the target table if the source is skewed. The source can be skewed either after a series of joins or after applying single-table predicates. You can use the HASH BY option to redistribute the rows to a NoPI table or column‑partitioned table to avoid skewing by specifying RANDOM or by choosing good expressions to hash on.

Fallback and index maintenance for a column‑partitioned table are done the same way as fallback and index maintenance for a primary‑indexed table.

The following list of considerations for inserting data into a NoPI table apply equally well to a column‑partitioned table.

  • Because they do not have a primary index, Teradata Database does not hash rows based on any column in a NoPI table. However, the system still generates a rowID for each row in a NoPI table. The process is that Teradata Database selects a hash bucket from the NoPI table hash map that an AMP owns. It then uses that hash bucket to generate a rowID. This strategy helps make fallback and index maintenance comparable to the maintenance on a primary‑indexed table.
  • For single‑statement INSERT requests, multistatement INSERT requests, and array INSERT operations into a NoPI table, Teradata Database sends the rows to the AMPs through a random generator. This random generator is designed in such a way that for a new request, data is generally sent to a different AMP from the one to which the previous request sent data. The concept is to balance the data among the AMPs as much as possible without the use of a primary index.
  • For INSERT … SELECT requests into a column‑partitioned or nonpartitioned NoPI target table, the SELECT component of the request can be either a simple SELECT (retrieving all data from the source table) or a complex SELECT (retrieving data from one or more source tables). Spooling the source table before inserting the new rows can be avoided for a simple SELECT in some cases.
  • Teradata Database can send data to any AMP in complete blocks prior to inserting the rows into a target column‑partitioned table. No data redistribution is required for each individual row. This is particularly beneficial for Teradata Parallel Data Pump array INSERT.
  • The row hash for each row in a column‑partitioned table is internally controlled and generated, so rows are always appended at the end of the table (if the table does not also have row partitioning) or combined partition (if the table also has row partitioning), and never inserted in a middle of a row hash. The result is that the rows need not be sorted if the column‑partitioned table does not have row partitioning.
  • There are several performance advantages for bulk INSERT operations.

  • Inserting data into a column‑partitioned table is somewhat more efficient than inserting into a primary‑indexed table because the data is appended to the end of a table or row partition and the rows are not required to be in any particular order.
  • This can be negatively offset for a column‑partitioned table by the need to transform from rows to columns.

    Bulk insert operations using an INSERT … SELECT request or Teradata Parallel Data Pump array INSERT operation can minimize this negative offset for column‑partitioned tables because they can apply the transformation of multiple rows to columns as a set instead of individually. Teradata Parallel Data Pump array INSERT operations can group as many rows as allowed in an AMP step because the rows are not required to be sent to a specific AMP.

    The transformation from rows to columns performs better if the number of column partitions (not including the DELETE column partition) does not exceed the number of available column partition contexts; otherwise, Teradata Database must perform additional scans of the source data.

  • If there is no row partitioning for the table, there is no need to sort the data.
  • Although Teradata Parallel Data Pump array INSERT operations do not have an explicit sort phase of rows in the array that is being inserted into a primary‑indexed table, the file system does an implicit memory‑resident sort of the data. This sort work is not done when data is appended to the end of the table for a nonpartitioned NoPI or column‑partitioned table that does not have row partitioning.

  • For INSERT … SELECT operations that do not specify a HASH BY clause into NoPI and column‑partitioned target tables, Teradata Database does not redistribute the data from the source table. Instead, Teradata Database appends the data locally into the target table.
  • This causes skew in the target table if the source is skewed, which can happen either after a series of joins or after applying single-table predicates. You can specify the HASH BY option to redistribute the rows to a NoPI or column‑partitioned table to avoid skewing by specifying RANDOM or by specifying good hashing expressions.

  • Data can be sent to and stored on any AMP.
  • Another performance advantage is that there is no requirement that a NoPI or column‑partitioned table row must be stored on any particular AMP. This is very useful for Teradata Parallel Data Pump array INSERT operations because data sent in a buffer from Teradata Parallel Data Pump can all be combined into the same step going to the same AMP for insertion.

    In contrast to this, Teradata Database generally splits the into multiple steps targeting the destination AMP for primary‑indexed table data. The performance impact increases as the number of AMPs in the system increases. With fewer steps to process, the CPU and I/O burdens on the system are both reduced.

    For single-row INSERT operations, the performance for a NoPI table compared to a primary‑indexed table is not significantly different. Appending a row to a NoPI table is somewhat more efficient than inserting a row into a primary‑indexed table in the middle of a hash value, but the end of transaction processing, including the flushing of the WAL log, remains the same. Column‑partitioned tables impose the additional performance burden of transforming the row into the column partitions.

    There is a potential disadvantage if the distribution of inserted rows to the AMPs is skewed. In most cases, the method of inserting data into both NoPI tables and column‑partitioned tables leads to a fairly even distribution of the rows. If the distribution of rows to the AMPs is skewed, you can populate the table using the HASH BY option with an INSERT … SELECT request (see SQL Data Manipulation Language for information about the HASH BY option).

    Teradata Database handles an INSERT … SELECT request in one of the following ways depending on the source table, the target table, and the number of available column partition contexts.

     

    IF there are …

    THEN Teradata Database reads the source table rows …

    sufficient column partition contexts available for the target column‑partitioned table

    a block‑at‑a‑time after they are spooled (if that is required), and then it builds, for each row in the block and then for each column partition, a column partition value and appends it to the last container (if there is no remaining container, Teradata Database starts a new one) of the corresponding internal partition or, if the column partition has ROW format, the subrow is written out.

    insufficient available column partition contexts available for the target column‑partitioned table

    for one set of column partitions and then reads again for another column partition until all the column partition values are appended.

    This eliminates multiple writes of the last container of internal partitions at the cost of multiple reads of the source, but reading is much less expensive than writing.