How it Works - Parallel Data Pump

Teradata® Parallel Data Pump Reference

Product
Parallel Data Pump
Release Number
16.20
Published
September 2019
Language
English (United States)
Last Update
2019-10-11
dita:mapPath
dmq1512702641516.ditamap
dita:ditavalPath
Audience_PDF_include.ditaval
dita:id
B035-3021
lifecycle
previous
Product Category
Teradata Tools and Utilities

Teradata TPump is a Teradata utility with functions similar to the MultiLoad utility. MultiLoad edits Teradata tables by processing insert, updates, and deletes, and so does Teradata TPump. This section provides insight into the important differences between MultiLoad and Teradata TPump. All of the information in this section is discussed in further detail later in this document, either explicitly or by implication.

Methods of Operation

MultiLoad performs Teradata Database updates in phases. During the first phase of operation, MultiLoad uses a special database and CLIv2 protocol for efficiently sending large (64 KB) data messages to the database. The data is stored in a temporary table. During the second phase of operation, the temporary table is sorted, then changes from it are applied to various target tables. In this phase, processing is entirely in the database while MultiLoad on the client waits to see if the job completes successfully.

Teradata TPump performs Teradata Database updates asynchronously. Changes are sent in conventional CLIv2 parcels and applied immediately to target tables. To improve its efficiency, Teradata TPump builds multiple statement requests and provides the serialize option to help reduce locking overhead.

Economy of Scale and Performance

MultiLoad performance improves as change volume increases because, in phase two of MultiLoad, changes are applied to target tables in a single pass. All changes for any physical data block are effected using one read and one write of the block. Furthermore, the temporary table and the sorting process used by MultiLoad are additional overheads that must be “amortized” through the volume of changes.

Teradata TPump, on the other hand, performs better for relatively low change volume because there is no temporary table overhead. Teradata TPump becomes expensive for large volumes of data because multiple updates to a physical data block will most likely result in multiple reads and writes of the block.

Loading No Primary Index (NoPI) Tables

A NoPI table has no primary index. These tables can be used as staging tables where data is always appended to the table, making population of the table generally faster than that of a traditional table containing a primary index.

NoPI tables could increase performance for Teradata TPump Array INSERT.

Multiple Statement Requests

The most important technique used by Teradata TPump to improve performance over MultiLoad is the multiple statement request. Placing more statements in a single request is beneficial for two reasons. First, it reduces network overhead because large messages are more efficient than small ones. Secondly, (in ROBUST mode) it reduces Teradata TPump recovery overhead, which amounts to one extra database row written for each request. Teradata TPump automatically packs multiple statements into a request based upon the PACK specification in the BEGIN LOAD command.

Teradata TPump PACK factor tells the client to send that many data records in the same request to the server for processing. When an error is detected in the server, the entire request is aborted and the error is sent back to the client. Teradata TPump then removes the data record that encountered the error out of the request and puts it into the Teradata TPump error table. The request will then be resubmitted to the server. Because errors are only handled one at a time and if there are multiple errors in a request, multiple rollbacks and resubmissions take place and performance can be poor.

Starting with Teradata TPump 14.00, a new client/server protocol has been resolves this error-handling performance issue. This feature handles multiple errors in the same request. If there are any data-related errors detected in the server, only the associated statements are rolled back. Other statements that completed successfully in the same request are not rolled back. The server then sends one error parcel back to the client for each statement that encountered an error, and one success parcel back to the client for each statement that completed successfully.

Since the data records that did not cause an error were successfully processed by the server, it is not required to resubmit the request after removing the data records that encountered errors. Teradata TPump continues with the next set of data records that need to be processed.

For more usage or functionality constraints, refer to SQL Data Manipulation Language (B035-1146).

Macro Creation

Teradata TPump uses macros to efficiently modify tables rather than actual DML commands. The technique of changing statements into equivalent macros before beginning the job greatly improves performance.

Specifically, the benefits of using macros are:

  • The size of network (and channel) messages sent to the database by Teradata TPump are reduced.
  • Teradata Database parsing engine overhead is reduced because the execution plans (or steps) for macros are cached and re-used. This eliminates normal parser handling, where each request sent by Teradata TPump is planned and optimized.

Because the space required by macros is negligible, the only issue regarding macros is where they are placed in the database. Macros are put into the database that contains the restart log table or the database specified using the MACRODB keyword in the BEGIN LOAD command.

Locking and Transactional Logic

In contrast to MultiLoad, Teradata TPump uses conventional row hash locking, which allows for some amount of concurrent read and write access to its target tables. At any point, Teradata TPump can be stopped while retaining full accessibility to the target tables. Note however, that if Teradata TPump is stopped, depending on the nature of the update process, the relational integrity of the data might be impaired.

This differs from MultiLoad, which operates as a single logical update to one or more target tables. Once MultiLoad goes into phase two of its logic, the job is essentially irreversible and all target tables are locked for write access until the job completes.

If Teradata TPump operates on rows that have associated triggers, the triggers are invoked as necessary.

Recovery Logic and Overhead

In Teradata TPump’s ROBUST mode, one database row is written in the log restart table for every request that it issues. This collection of rows in the restart log table can be referred to as the request log. Because a request is guaranteed by the database to either completely finish or completely rollback, the request log will always accurately reflect the completion status of a Teradata TPump import. Thus, the request log overhead for restart logic decreases as the number of statements packed per request increases.

Teradata TPump also allows a checkpoint interval to be specified. During the checkpoint process Teradata TPump flushes all pending changes from the import file to the database and also cleans out the request log. The larger the checkpoint interval, the larger the request log (and its table) is going to grow. Upon an unexpected restart, Teradata TPump scans the import data source along with the request log in order to re-execute the statements not found in the request log.

In Teradata TPump’s SIMPLE (non-ROBUST) mode, basic checkpoints are created. If a restart occurs between checkpoints, then some requests will likely be reprocessed. This is adequate protection under some circumstances.

In contrast, phase one of MultiLoad uses checkpoints so restarts do not force a job to always restart from the beginning. During phase two, MultiLoad uses its temporary table as a repository of all changes to be applied. The database process of applying the changes guarantees that no changes are missed or applied more than once.

Serialization of Changes

In certain uses of Teradata TPump or MultiLoad, it is possible to have multiple changes to one row in a single job. For instance, a row might be inserted, then updated during the batch job, or it might be updated, then deleted. In any case, the correct ordering of these operations is obviously very important. MultiLoad automatically guarantees that this ordering of operations is maintained correctly. By using the serialization feature, Teradata TPump can accomplish the same ordering of operations, but to make it happen in Teradata TPump, a small amount of scripting work and utility overhead are required.

The use of the serialize option on the BEGIN LOAD command guarantees that Teradata TPump will send each change for a data record of a given key in order. The KEY modifier to the FIELD command is how a script specifies that a given field is to be part of the serialization key. The intent of this feature is to allow specification of the key corresponding to the primary index of the target table. In fact, the TABLE command automatically qualifies the generated fields with the KEY modifier when the fields are part of the primary index of the table. If the DML statements in the Teradata TPump script specify more than one target table then it is up to the script author to make sure that primary indices of all the tables match when using the serialization feature.

The serialization feature works by hashing each data record based upon its key to determine which session transmits the record to the database. Thus the extra overhead in the application is derived from the mathematical operation of hashing and from the extra amount of buffering necessary to save data rows when a request is already pending on the session chosen for transmission.

The serialization feature greatly reduces the potential frequency of database deadlock. Deadlocks can occur when requests for the application happen to affect row(s) that use the same hash code within the database. Although deadlocks are handled by the database and by Teradata TPump correctly, the resolution process is time-consuming and adds additional overhead to the application because it must re-execute requests that roll back due to deadlock.

In addition to using SERIALIZEON in the BEGIN LOAD command, the SERIALIZEON keyword can also be specified in the DML command. This lets serialization to be turned on for the fields specified. For more information on the DML-based serialization feature, refer to DML.

Dual Database Strategy

The serialization feature is intended to support a variety of other potential customer applications that go under the general heading dual database. These are applications that in some way take a live feed of inserts, updates, and deletes from another database and apply them without any preprocessing to Teradata Database.

Both Teradata TPump and MultiLoad are potential parts of the dual database strategy. A dual database application will generate a DML stream which will be routed to Teradata TPump or MultiLoad through a paramod/inmod specific to the application. The choice between Teradata TPump or MultiLoad will depend on such things as the volume of data (with higher volumes favoring MultiLoad) and the concurrent access requirements (with greater access requirements favoring Teradata TPump).

Resource Usage and Limitations

A feature unique to Teradata TPump is the ability to constrain runtime resource usage through the statement rate feature. Teradata TPump provides control over the rate per minute at which statements are sent to the database and the statement rate correlates directly to resource usage on both the client and in the database. The statement rate can be controlled in two ways, either dynamically while the job is running, or it can be scripted into the job with the RATE keyword on the BEGIN LOAD command. Dynamic control over the statement rate is provided by updates to a table on the database.

In contrast with Teradata TPump, MultiLoad always uses CPU and memory very efficiently. During phase one (assuming that the database is not a bottleneck), MultiLoad will probably bottleneck on the client, consuming significant network or channel resources. During phase two, MultiLoad uses very significant database disk, CPU, and memory resources. In fact, the database limits the number of concurrent MultiLoad, FastLoad, and FastExport jobs for the very reason that they are so resource-intensive. Teradata TPump has no such database-imposed limitation.

Although Teradata Database imposes no limitation on the number of concurrent Teradata TPump jobs that are possible, an excessive number of small jobs causes contention on the Teradata Database system catalog. The limit will vary from one installation to another, and each installation should determine its own capacity for running a multiplicity of Teradata TPump jobs to avoid potential deadlocks.