Using the DataConnector Operator to Write Files and Tables in Hadoop

Using the DataConnector Operator to Write Files and Tables in Hadoop - Parallel Transporter

Teradata® Parallel Transporter User Guide

Product

Parallel Transporter

Release Number

16.20

Published

August 2020

Language

English (United States)

Last Update

2020-08-27

dita:mapPath

uah1527114222342.ditamap

dita:ditavalPath

Audience_PDF_product_tpt_userguide_include.ditaval

dita:id

B035-2445

lifecycle

Product Category

Teradata Tools and Utilities

In addition to writing flat files and interfacing with access modules, the DataConnector operator also has the ability to write to Hadoop files and tables. The following table briefly describes and compares the two interfaces which the DataConnector operator can use to move data from the data stream to Hadoop files and tables.

Interface	Description
HDFS API	Provides access to Hadoop files using the Hadoop Distributed File System Application Programming Interface, or HDFS API. The HDFS is a POSIX-compatible file system with some minor restrictions. It does not support updating files and it only supports writing files in truncate mode or append mode. The Hadoop Software is written in Java and the HDFS API is a Java JNI interface that exposes all the expected standard posix file system interfaces for reading and writing HDFS files directly by a C/C++ program. The Data Connector Producer and Consumer operators have been updated to directly access the HDFS file system using the HDFS API. All standard Data Connector file system features are supported.
TDCH-TPT	Provides access to Hadoop files and tables using the Teradata Connector for Hadoop, or TDCH. TDCH utilizes the MapReduce framework's distributed nature to transfer large amounts of data in parallel from Hadoop files and tables to the DataConnector operator. The TDCH-TPT interface gives TPT users the ability to read and write HDFS files, Hive tables, and Hcat tables in various Hadoop-specific formats. Because this interface relies on TDCH to read and write data, many of the traditional DataConnector attributes are unsupported.

Interface

Description

HDFS API

Provides access to Hadoop files using the Hadoop Distributed File System Application Programming Interface, or HDFS API. The HDFS is a POSIX-compatible file system with some minor restrictions. It does not support updating files and it only supports writing files in truncate mode or append mode. The Hadoop Software is written in Java and the HDFS API is a Java JNI interface that exposes all the expected standard posix file system interfaces for reading and writing HDFS files directly by a C/C++ program. The Data Connector Producer and Consumer operators have been updated to directly access the HDFS file system using the HDFS API. All standard Data Connector file system features are supported.

TDCH-TPT

Provides access to Hadoop files and tables using the Teradata Connector for Hadoop, or TDCH. TDCH utilizes the MapReduce framework's distributed nature to transfer large amounts of data in parallel from Hadoop files and tables to the DataConnector operator. The TDCH-TPT interface gives TPT users the ability to read and write HDFS files, Hive tables, and Hcat tables in various Hadoop-specific formats. Because this interface relies on TDCH to read and write data, many of the traditional DataConnector attributes are unsupported.

For information, see the "Processing Hadoop Files and Tables" section in Teradata® Parallel Transporter Reference, B035-2436.

GZIP and ZIP files are not supported with Hadoop/HDFS.
HDFS processing can be activated simply by adding the following attribute to a Data Connector Consumer or Producer: HadoopHost = 'default’