15.10 - Writing Job Scripts for Scalable Performance - Parallel Transporter

Teradata Parallel Transporter User Guide

Parallel Transporter
User Guide

Writing Job Scripts for Scalable Performance

Using Multiple Operator Instances for Scalability

As a multi-process application that exploits the parallel and scalable framework, Teradata PT makes it possible to use additional CPU processing power, shorten the load process, and reduce overall job execution time. You can specify the number of operator instances in your job script. This gives you the control over the scalability and performance of the data loading process.

In addition, Teradata PT allows data extraction and data loading to run completely asynchronously from each other. This supports broader parallelism, which further improves performance.

With traditional Teradata standalone utilities, such as FastLoad, MultiLoad, and Tpump, which rely on a single system process to perform data extraction and loading, a single process can reach a threshold beyond which CPU speed cannot increase, a critical limiting factor.

Using Directory Scan for Loading Files in Parallel

Teradata PT provides a feature called Directory Scan that enables data files in a directory to be processed in a parallel and scalable manner as part of the loading process. In addition, if multiple directories are stored across multiple disks, a special feature in Teradata PT called UNION ALL can be used to process these directories of files in parallel, thus achieving more throughputs across disks. See “Combining Multiple Sources using UNION ALL” below.

Directory scans also provide an option that lets users select files for processing based on file names, which include wildcard specifications. The DataConnector operator provides scalable and parallel access to multiple files in a load-balancing manner. By load balancing we mean that the files are distributed as evenly as possible based on file sizes among operator instances.

Teradata standalone utilities, such as FastLoad, MultiLoad, and Tpump only allow one file to be processed at a time.

Note: The Directory Scan functionality is supported when using the HDFS API interface to process Hadoop files, but is not supported when using the TDHC-TPT interface to process Hadoop files and tables. For more information, see “Processing Hadoop Files and Tables” in Chapter 3 of the Teradata Parallel Transporter Reference.

Combining Multiple Sources using UNION ALL

Similar to the UNION ALL operation which allows multiple UNION-compatible tables to be combined, the Teradata PT UNION ALL feature allows similar or dissimilar data sources to be combined into a single source that can be processed in a parallel and scalable manner. This operation also eliminates the need for manually merging multiple data sources as input for loading.

As shown in Figure 49, multiple copies of access modules can be launched by multiple instances of the DataConnector operator for reading transactional data from the same or different message queues. This parallel arrangement, which enables data parallelism, can significantly improve the performance of data extraction.

Figure 49: Parallel Reading of MQ via UNION ALL