Bulk Loading a Column‑Partitioned Table
The expected method of populating a column‑partitioned table is an INSERT … SELECT request from one or more source tables.
If the data is from an external source, you can use FastLoad to load the data to a staging table and then populate the column‑partitioned table with an INSERT … SELECT request. You can also populate a column‑partitioned table from an external source using a Teradata Parallel Data Pump array INSERT to insert data into the column‑partitioned table. However, Teradata Parallel Data Pump is not expected to be as efficient as a FastLoad and INSERT … SELECT request.
You should rarely update column‑partitioned tables using single-row INSERT requests because such requests can cause a large degradation in performance by needing to append a column partition value to each of the column partitions.
Inserting data into a column‑partitioned NoPI or PA table is expected to gain some efficiency over a primary indexed‑table because the data is just appended to the end of a table or row partition, and the rows are not required to be in any particular order. However, this can be negatively offset for a column‑partitioned table by the need to transform rows into columns.
Bulk inserts using an INSERT … SELECT request or Teradata Parallel Data Pump array INSERT can minimize this impact for column‑partitioned tables because they can apply the transformation of multiple rows to columns as a set instead of individually. Teradata Parallel Data Pump array inserts can group as many data rows as are allowed in an AMP step because the rows are not required to be sent to a specific AMP.
The transformation from rows to columns performs better if the number of column partitions (not including the delete column partition) does not exceed the number of available column partition contexts; otherwise, Teradata Database must make additional scans of the source data.
For INSERT … SELECT (without a HASH BY clause) requests into a NoPI or column‑partitioned NoPI target table, Teradata Database does not redistribute data from the source table, but instead locally appends it to the target NoPI table. Keep in mind that this results in skew in the target table if the source is skewed. The source can be skewed either after a series of joins or after applying single-table predicates. You can use the HASH BY option to redistribute the rows to a NoPI table or column‑partitioned NoPI table to avoid skewing by specifying RANDOM or by choosing good expressions to hash on.
The following list of considerations for inserting data into a NoPI table apply equally well to a column‑partitioned NoPI table.
There are several performance advantages for bulk INSERT operations.
This can be negatively offset for a column‑partitioned NoPI or PA table by the need to transform from rows to columns.
Bulk insert operations using an INSERT … SELECT request or Teradata Parallel Data Pump array INSERT operation can minimize this negative offset for column‑partitioned tables because they can apply the transformation of multiple rows to columns as a set instead of individually. Teradata Parallel Data Pump array INSERT operations into a NoPI can group as many rows as allowed in an AMP step because the rows are not required to be sent to a specific AMP.
The transformation from rows to columns performs better if the number of column partitions (not including the DELETE column partition) does not exceed the number of available column partition contexts; otherwise, Teradata Database must perform additional scans of the source data.
Although Teradata Parallel Data Pump array INSERT operations do not have an explicit sort phase of rows in the array that is being inserted into a primary‑indexed table, the file system does an implicit memory‑resident sort of the data. This sort work is not done when data is appended to the end of the table for a NoPI or PA table that does not have row partitioning.
This causes skew in the target table if the source is skewed, which can happen either after a series of joins or after applying single-table predicates. You can specify the HASH BY option to redistribute the rows to a NoPI table to avoid skewing by specifying RANDOM or by specifying good hashing expressions.
Another performance advantage is that there is no requirement that a NoPI table row must be stored on any particular AMP. This is very useful for Teradata Parallel Data Pump array INSERT operations because data sent in a buffer from Teradata Parallel Data Pump can all be combined into the same step going to the same AMP for insertion.
In contrast to this, Teradata Database generally splits the into multiple steps targeting the destination AMP for PI or PA table data. The performance impact increases as the number of AMPs in the system increases. With fewer steps to process, the CPU and I/O burdens on the system are both reduced.
For single-row INSERT operations, the performance for a NoPI table compared to a primary‑indexed table is not significantly different. Appending a row to a NoPI table is somewhat more efficient than inserting a row into a primary‑indexed table in the middle of a hash value, but the end of transaction processing, including the flushing of the WAL log, remains the same. Column‑partitioned tables impose the additional performance burden of transforming the row into the column partitions.
There is a potential disadvantage if the distribution of inserted rows to the AMPs is skewed. In most cases, the method of inserting data into NoPI tables leads to a fairly even distribution of the rows. If the distribution of rows to the AMPs is skewed, you can populate the NoPI table using the HASH BY option with an INSERT … SELECT request (see SQL Data Manipulation Language for information about the HASH BY option).
Teradata Database handles an INSERT … SELECT request in one of the following ways depending on the source table, the target table, and the number of available column partition contexts.
IF there are … |
THEN Teradata Database reads the source table rows … |
sufficient column partition contexts available for the target column‑partitioned table |
a block‑at‑a‑time after they are spooled (if that is required), and then it builds, for each row in the block and then for each column partition, a column partition value and appends it to the last container (if there is no remaining container, Teradata Database starts a new one) of the corresponding internal partition or, if the column partition has ROW format, the subrow is written out. |
insufficient available column partition contexts available for the target column‑partitioned table |
for one set of column partitions and then reads again for another column partition until all the column partition values are appended. This eliminates multiple writes of the last container of internal partitions at the cost of multiple reads of the source, but reading is much less expensive than writing. |