Processing Hadoop Files and Tables
In addition to reading and writing from flat files and access modules, the DataConnector
operator also has the ability to read and write Hadoop files and tables. Based on
the set of attributes submitted with the DataConnector operator, one of two Hadoop
interfaces will be used.
HDFS API Interface
When the attribute HadoopHost is specified with the DataConnector operator, the operator
will use the HDFS API interface to process Hadoop files.
The TPT HDFS Interface can be invoked on the cluster or on a remote Hadoop Client.
The HadoopHost property identifies the cluster where the HDFS operation is to be performed
and its presence activates the HDFS operation. All standard file functions and features
of the Data Connector will be performed on the HDFS file system when it has been activated
as described above.
TDCH-TPT Interface
When any Hadoop attributes besides HadoopHost and HadoopUser are submitted with the
DataConnector operator, the operator will use the TDCH-TPT interface to process Hadoop
files and tables. Throughout the rest of this section, these Hadoop attributes will
be referred to as TDCH-specific Hadoop attributes.
Teradata Connector for Hadoop
The Teradata Connector for Hadoop, or TDCH, is a set of APIs and tools that support
high-performance parallel bi-directional data movement between Teradata systems and
products in the Hadoop ecosystem. TDCH is built atop the MapReduce framework, and
utilizes its distributed nature to offer extreme scalability and performance when
transferring data between Teradata systems and Hadoop. For more information about
TDCH, see the Teradata Connector for Hadoop tutorial.
Utilizing TDCH in TPT Scripts via the TDCH-TPT Interface
The TDCH-TPT interface is a bridge between TPT and TDCH. The TDCH-TPT interface extends
TDCH to support Hadoop file and table transfers to TPT, and vice versa. This interface
gives TPT users the ability to utilize all of the pre-existing TDCH functionality
within a TPT script, and gives TDCH users the ability to utilize TPT-specific functionalities
alongside TDCH.
When a TPT job script includes the DataConnector operator alongside any of the TDCH-specific
Hadoop attributes, the DataConnector operator will launch a TDCH job using those TDCH-specific
Hadoop attributes supplied in the TPT script. Once TDCH has validated the attribute
values and filled in defaults for any missing attributes, TDCH will submit the job
to the MapReduce framework. Once the map tasks have been initialized on the nodes
in the Hadoop cluster, they will connect to the DataConnector operator and begin transferring
data.
Limitations to the TDCH-TPT interface
To utilize the TDCH-TPT interface, the node on which TPT is running must have the
Hadoop client jars installed, as the DataConnector operator must be able to launch
a MapReduce job via a call to the Hadoop CLI.
The TDCH-TPT interface is only supported on the Linux platform.
Because the DataConnector operator relies on TDCH to read from and write to Hadoop
files and tables, many of the traditional DataConnector operator attributes are not
supported alongside the TDCH-TPT interface. For example, when using the DataConnector
producer, the FileName attribute is superseded by the TDCH-specific HadoopSourcePaths
attribute. Similarly, the Format attribute is superseded by the TDCH-specific HadoopFileFormat
attribute. If an unsupported attribute is submitted alongside TDCH-specific Hadoop
attributes, the TPT job will fail.
Due to TDCH's batch processing nature, many of the DataConnector operator's active
data warehousing features are not supported when using the TDCH-TPT interface. For
example, because the MapReduce job processes data out-of-order, the DataConnector's
checkpoint/restart feature is unavailable when utilizing the TDCH-TPT interface. Similarly,
because the TDCH job requires a single file or table name as an argument, the DataConnector
operator's directory scan feature is unavailable when utilizing TDCH-TPT interface.
If an unsupported feature is utilized alongside the TDCH-TPT interface, the TPT job
will fail.
When using the TDCH-TPT interface to process Hadoop files and tables, multiple instances
of the DataConnector operator are not supported. If multiple instances of the DataConnector
are defined, the TPT job will fail.
Debugging the TDCH-TPT Interface
In the scenario that a failure occurs during TDCH job setup or data transfer between
TDCH and TPT, the TDCH log is available in the TPT logs directory. The name of the
TDCH log will be "TDCH-TPT_log_<job-id>.txt," where "<job_id>" is the job's process
ID. For more information, see the MapReduce logs via the JobTracker's web interface
or by navigating to the Hadoop installation's userlogs directory.