Hadoop Support - Teradata Data Mover

Teradata Data Mover User Guide

Product

Teradata Data Mover

Release Number

16.00

Published

December 2016

Language

English (United States)

Last Update

2018-03-29

dita:mapPath

rdo1467305237457.ditamap

dita:ditavalPath

ft:empty

dita:id

B035-4101

lifecycle

Product Category

Analytical Ecosystem

Data Mover supports copying tables between Teradata Database and Hadoop.

Hadoop Components

Data Mover requires the following Hadoop services:

Hive: Service for accessing data using a SQL-like language. Apache Hive provides data summarization and ad hoc querying. Data Mover supports movement of tables from Teradata to Hive and Hive to Teradata.
HttpFS and WebHDFS: REST service that provides a HTTP gateway to support all HDFS File System operations.
WebHCAT: REST API data integration services that must be enabled for HCatalog. Used to gather hive table DDL.
Oozie: Workflow scheduler system to manage, submit, and monitor Hadoop jobs. The service is used to execute TDCH jobs.

Utilities

Data Mover can use either TDCH or QueryGrid T2H (known as SQL-H prior Teradata 15.00) to move data between Teradata and Hadoop. If TDCH and T2H are both available, Data Mover will use T2H.

TDCH: Command-line connector that provides a set of API and tools that support high-performance, parallel, bi-directional data movement between Teradata systems and the Hadoop ecosystem of products. For more information, see Teradata Connector for Hadoop on Teradata Developer Exchange.

T2H

QueryGrid T2H is Teradata software that enables import and export of data between Teradata and Hadoop. T2H is a separate add-on package, available as T2H for Teradata database 15.00 and later and as SQL-H for Teradata database 14.10 and later.

For Teradata Database 15.0 or later, you must specify a foreign server to use T2H to move data between Teradata and Hadoop. If you do not specify a foreign server, Data Mover will use TDCH.

SQL-H support for CDH and HDP is for Teradata Database 14.10 only.

JSON Support

If the target table in Teradata has a JSON column type, you can copy data between Hadoop to Teradata Database only if the following conditions are met:

T2H is used to copy data from Hadoop to Teradata.
A foreign server has been specified in the job definition.

Data Mover supports moving tables with JSON columns using BSON or UBJSON storage formats when moving between Teradata Database versions 15.10 or later. When moving from Teradata Database 15.0 to Teradata Database 15.10 or later, the default text storage format is used for JSON columns.

Restrictions

You cannot use TDCH to copy data to a JSON column type.
Data Mover does not support copying a JSON column type from Teradata Database to Hadoop.

Security

You can copy tables between Teradata and Hadoop with the following settings:

The Hive user specified in the Data Mover job must have read/write/execute permissions on the Hive table's underlying files and folder in HDFS.
Oozie must be able to impersonate the Hive user to execute the TDCH job.
QueryGrid T2H must be able to impersonate the Hive user to execute the T2H job. For more information, see Teradata QueryGrid: Teradata Database-to-Hadoop User Guide, B035-1185.

The Teradata Connector for Hadoop also provides for data movement between Teradata Database and supported Hadoop systems secured with Kerberos. For more information, consult with Teradata Customer Services.

Object Support Restrictions

Data Mover only allows tables to be copied between Teradata and Hive. You cannot copy other objects, such as views, indexes, or an entire database, between Teradata and Hive.

If a user specifies column information for a table, such as name or type, and the target table already exists, the column information is used to compare DDL of source and target tables. If the target table does not exist, Data Mover uses the column information to create the target table. If you do not specify column information, the source table metadata is used for column name or type. Data Mover does not drop a target table for Hadoop support. For more information, see Objects Supported During Moves Between Databases.

For more information about data types that are supported or restricted, see Data Type Mappings.

Throttling

Data Mover allows you to limit the number of Hadoop Connector and T2H Tasks that run concurrently. Every table moved using the Hadoop Connector or T2H is encapsulated as a single task respectively.

You can limit the number of Hadoop connector tasks that run concurrently by setting a value for the hadoop.connector.max.task.slot property using the save_configuration command. You can also limit the number of T2H tasks that run concurrently by setting a value for the sqlh.max.task.slot property using the save_configuration command.

The default value for hadoop.connector.max.task.slot and sqlh.max.task.slot is 2.

Known Issues

If you cannot retrieve metadata from WebHCAT and see error 143 in dmDaemon.log, then you can increase the timeout value for WebHCAT parameter templeton.exec.timeout.
If the number of Oozie jobs executed concurrently is higher than the number of mappers in the Hadoop cluster, the Oozie jobs are deadlocked. The workaround is to not allow Oozie jobs to start concurrently by adding a wait time between each Oozie job. The property hadoop.task.throttle, located in agent.properties, adds a wait time (in seconds) between two jobs. Use this property if you are having deadlock issues related to Oozie.
Tune T2H on Teradata Database to ensure proper handling of concurrent T2H queries. If the number of concurrent T2H queries is greater than that configured in Teradata Database, T2H might become unstable and cause errors such as a PermGen exception or hanging T2H query, which would require a restart of the UDF server process. You can throttle the maximum number of T2H queries Data Mover runs concurrently using the sqlh.max.task.slot property in Data Mover configuration.