Processing Hadoop Files and Tables - Parallel Transporter

Teradata Parallel Transporter Reference

Product
Parallel Transporter
Release Number
15.00
Language
English (United States)
Last Update
2018-09-27
dita:id
B035-2436
lifecycle
previous
Product Category
Teradata Tools and Utilities

Processing Hadoop Files and Tables

In addition to reading and writing from flat files and access modules, the DataConnector operator also has the ability to read and write Hadoop files and tables. Based on the set of attributes submitted with the DataConnector operator, one of two Hadoop interfaces will be used.

HDFS API Interface

  • When the attribute HadoopHost is specified with the DataConnector operator, the operator will use the HDFS API interface to process Hadoop files.
  • The TPT HDFS Interface can be invoked on the cluster or on a remote Hadoop Client. The HadoopHost property identifies the cluster where the HDFS operation is to be performed and its presence activates the HDFS operation. All standard file functions and features of the Data Connector will be performed on the HDFS file system when it has been activated as described above.
  • TDCH-TPT Interface

  • When any Hadoop attributes besides HadoopHost and HadoopUser are submitted with the DataConnector operator, the operator will use the TDCH-TPT interface to process Hadoop files and tables. Throughout the rest of this section, these Hadoop attributes will be referred to as TDCH-specific Hadoop attributes.
  • Teradata Connector for Hadoop
  • The Teradata Connector for Hadoop, or TDCH, is a set of APIs and tools that support high-performance parallel bi-directional data movement between Teradata systems and products in the Hadoop ecosystem. TDCH is built atop the MapReduce framework, and utilizes its distributed nature to offer extreme scalability and performance when transferring data between Teradata systems and Hadoop. For more information about TDCH, see the Teradata Connector for Hadoop tutorial.
  • Utilizing TDCH in TPT Scripts via the TDCH-TPT Interface
  • The TDCH-TPT interface is a bridge between TPT and TDCH. The TDCH-TPT interface extends TDCH to support Hadoop file and table transfers to TPT, and vice versa. This interface gives TPT users the ability to utilize all of the pre-existing TDCH functionality within a TPT script, and gives TDCH users the ability to utilize TPT-specific functionalities alongside TDCH.
  • When a TPT job script includes the DataConnector operator alongside any of the TDCH-specific Hadoop attributes, the DataConnector operator will launch a TDCH job using those TDCH-specific Hadoop attributes supplied in the TPT script. Once TDCH has validated the attribute values and filled in defaults for any missing attributes, TDCH will submit the job to the MapReduce framework. Once the map tasks have been initialized on the nodes in the Hadoop cluster, they will connect to the DataConnector operator and begin transferring data.
  • Limitations to the TDCH-TPT interface
  • To utilize the TDCH-TPT interface, the node on which TPT is running must have the Hadoop client jars installed, as the DataConnector operator must be able to launch a MapReduce job via a call to the Hadoop CLI.
  • The TDCH-TPT interface is only supported on the Linux platform.
  • Because the DataConnector operator relies on TDCH to read from and write to Hadoop files and tables, many of the traditional DataConnector operator attributes are not supported alongside the TDCH-TPT interface. For example, when using the DataConnector producer, the FileName attribute is superseded by the TDCH-specific HadoopSourcePaths attribute. Similarly, the Format attribute is superseded by the TDCH-specific HadoopFileFormat attribute. If an unsupported attribute is submitted alongside TDCH-specific Hadoop attributes, the TPT job will fail.
  • Due to TDCH's batch processing nature, many of the DataConnector operator's active data warehousing features are not supported when using the TDCH-TPT interface. For example, because the MapReduce job processes data out-of-order, the DataConnector's checkpoint/restart feature is unavailable when utilizing the TDCH-TPT interface. Similarly, because the TDCH job requires a single file or table name as an argument, the DataConnector operator's directory scan feature is unavailable when utilizing TDCH-TPT interface. If an unsupported feature is utilized alongside the TDCH-TPT interface, the TPT job will fail.
  • When using the TDCH-TPT interface to process Hadoop files and tables, multiple instances of the DataConnector operator are not supported. If multiple instances of the DataConnector are defined, the TPT job will fail.
  • Debugging the TDCH-TPT Interface
  • In the scenario that a failure occurs during TDCH job setup or data transfer between TDCH and TPT, the TDCH log is available in the TPT logs directory. The name of the TDCH log will be "TDCH-TPT_log_<job-id>.txt," where "<job_id>" is the job's process ID. For more information, see the MapReduce logs via the JobTracker's web interface or by navigating to the Hadoop installation's userlogs directory.