Variables in the TPump Script - Parallel Data Pump

Teradata® Parallel Data Pump Reference

Product
Parallel Data Pump
Release Number
17.10
Published
June 2021
Language
English (United States)
Last Update
2021-07-01
dita:mapPath
oqw1608578437373.ditamap
dita:ditavalPath
ovd1619195504008.ditaval
dita:id
B035-3021
lifecycle
previous
Product Category
Teradata Tools and Utilities

For the most part, you may choose how to set TPump variables in the script. The key variables are PACK factor, Number of SESSIONS, ARRAYSUPPORT (ON or OFF) and SERIALIZE (ON or OFF).

PACK Factor
The PACK factor is one of the key performance enablers in TPump, and represents the number of rows that each session’s buffer will hold. The buffer is approximately 64KB in size. The default value of PACK is 20. In general, a higher pack factor usually yields better performance than a smaller pack factor. Try PACKMAXIMUM in a test environment to have TPump determine maximum Pack. The optimal PACK factor value is in a UTY6652 message.
SESSIONS
This option of the BEGIN LOAD command tells TPump the maximum (and optionally the minimum) number of sessions to transfer data to the Teradata RDBMS.
ARRAYSUPPORT

This feature offers a new data-driven iteration capability that provides an improved way for TPump and other SQL clients to iterate a parameterized DML statement for multiple sets of parameter values within a single request. ARRAYSUPPORT increases the amount of data that can be sent per request from a total of approximately 64KB to a total of approximately 1MB, with a limit of 64KB per input data row.

SERIALIZE

This option of the BEGIN LOAD command can be used to ensure the order of application of records and/or to reduce row hash lock contention between multiple SESSIONS. TPump will calculate a numeric value based on the user’s selection for the USE option. The resulting numeric value is then reduced modulo the number of sessions to determine which session gets the data record. SERIALIZE ensures that all input data records that specify the same primary index value are submitted to the Teradata RDBMS on the same session. In the case of TPump, the user must properly identify the columns of the primary index to TPump via the KEY option on the FIELD command.

These variables are correlated. Here are some recommendations:

  • A general TPump throughput recommendation is “Pack up to maximum and Sessions Up until trouble”. If experiencing AWT, CPU, or hash collision problems, reduce sessions or use SERIALIZE ON. When a throughput reduction is observed at a higher number of SESSIONS for a given PACK, reduce the number of SESSIONS by 8 or 16 from current session count.
  • Keep SESSIONS the same to maintain throughput with SERIALIZE ON but a lower PACK might be required. Starving sessions with lower pack factor slows the row rates but also reduces the amount and duration of hash contention. Reduce packs by 15% for tables with medium-to-large row size; reduce packs by 25% for tables with small row size. This is often a consideration when consecutive input data records have the same primary index value. A large number of hash synonyms on the same session with a high PACK value reduces throughput.
  • Records with same primary index hash to the same session with SERIALIZE ON. Data “clumps” with fewer sessions can slow throughput. Data “clumps” often arise if the input data file is sorted and/or consecutive input data records have the same primary index value.
  • Increasing PACK with the same session footprint will increase rows-per-second throughput. Packing improves network/channel efficiency by reducing the number of sends and receives between the application and the database. However, TPump will automatically reduce the PACK factor if certain internal limits are reached. The user may consider lowering PACK when experiencing hash collisions with SERIALIZE, data errors, or unacceptable latency due to filling a large PACK buffer.
  • With SERIALIZE ON, session count needs to be kept as an odd number. This avoids an unused session due to the modulo arithmetic used to determine the session to which the input data record is assigned.
  • Using ARRAYSUPPORT could give better throughput. Specifically, TPump dynamically determines the optimal PACK factor for input data with variable-length fields in ARRAYSUPPORT mode. The user sets the PACKMAXIMUM option, and TPump will then fill up to that or until the buffer is full on a request by request basis. Doing so will not cause problems in statement cache; it is the PA (Parameter Array) that receives the most performance benefit from higher PACK factor. Similarly, for NOPI, TPump will benefit from higher PACK factor. The optimal PACK factor is established with the restriction of the total bytes not exceeding 1MB.
  • Try to use the highest practical PACK with ARRAYSUPPORT and its 1MB buffer. If ARRAYSUPPORT cannot be used, the buffer size is 64K. The goal is to pack the buffer with a reasonably high number of rows. One approach is to calculate the approximate row size of the table, determine how many rows fit in that buffer, and then reduce the number rows by 20% to account for row size variability and communication overhead.