Hadoop Support | Teradata Data Mover - Hadoop Support

Hadoop Support | Teradata Data Mover - Hadoop Support - Teradata Data Mover

Teradata® Data Mover User Guide

Product

Teradata Data Mover

Release Number

16.20

Published

November 2021

Language

English (United States)

Last Update

2021-11-04

dita:mapPath

wph1512683331242.ditamap

dita:ditavalPath

4101_UG_dm_1620.ditaval

dita:id

B035-4101

lifecycle

Product Category

Analytical Ecosystem

Data Mover supports copying tables and data returned by views between Teradata Database and Hadoop. You can copy view data to Hadoop tables, but copying view definitions to Hadoop is not supported.

Hadoop Components

Data Mover requires the following Hadoop services:

Service	Description
Hive	Service for accessing data using a SQL-like language. Apache Hive provides data summarization and ad hoc querying. Data Mover supports movement of tables from Teradata to Hive and Hive to Teradata using Hiveserver2 to run queries. Only the binary value for the hive.server2.transport.mode property is supported.
HttpFS and WebHDFS	REST service that provides a HTTP gateway to support all HDFS File System operations.
WebHCAT	REST API data integration services that must be enabled for HCatalog. Used to gather hive table DDL.
Oozie	Workflow scheduler system to manage, submit, and monitor Hadoop jobs. The service is used to run TDCH jobs.

Utilities

Data Mover can use either TDCH or QueryGrid to move data between Teradata and Hadoop.

Utility	Description
TDCH	Command-line connector that provides a set of API and tools that support high-performance, parallel, bi-directional data movement between Teradata systems and the Hadoop ecosystem of products. For more information, see Teradata Connector for Hadoop on Teradata Developer Exchange.
QueryGrid	QueryGrid is Teradata software that enables the importing and exporting of data between Teradata and Hadoop. In Data Mover, QueryGrid T2H must be specified as a foreign server to be considered available for use. In Data Mover, T2H supports QueryGrid 2.x Teradata Hadoop connectors.

If both TDCH and QueryGrid are available, Data Mover uses QueryGrid. If no foreign server is specified, Data Mover uses TDCH.

When copying view data to Hadoop tables, Data Mover uses QueryGrid. TDCH does not support copying view data.

JSON Support

If the target table in Teradata has a JSON column type, you can copy data from Hadoop to Teradata Database only if the following conditions are met:

T2H is used to copy data from Hadoop to Teradata.
A foreign server has been specified in the job definition.

Data Mover supports moving tables with JSON columns using BSON or UBJSON storage formats when moving between Teradata Database versions 15.10 or later. When moving from Teradata Database 15.0 to Teradata Database 15.10 or later, the default text storage format is used for JSON columns.

Restrictions

You cannot use TDCH to copy data to a JSON column type.
Data Mover does not support copying a JSON column type from Teradata Database to Hadoop.

Security

You can copy tables between Teradata and Hadoop with the following settings:

The Hive user specified in the Data Mover job must have read/write/execute permissions on the Hive table's underlying files and folder in HDFS.
Oozie must be able to impersonate the Hive user to run the TDCH job.

The Teradata Connector for Hadoop also provides for data movement between Teradata Database and supported Hadoop systems secured with Kerberos. For more information, consult with Teradata Customer Services.

Object Support Restrictions

Data Mover only allows tables and data returned by views to be copied between Teradata and Hadoop. You cannot copy other objects, such as views, indexes, or an entire database, between Teradata and Hadoop.

If a user specifies column information for a table, such as name or type, and the target table already exists, the column information is used to compare DDL of source and target tables. If the target table does not exist, Data Mover uses the column information to create the target table. If you do not specify column information, the source table metadata is used for column name or type. Data Mover does not drop a target table for Hadoop support. For more information, see Objects Supported During Moves Between Databases.

For more information about data types that are supported or restricted, see Data Type Mappings.

Throttling

Data Mover allows you to limit the number of Hadoop Connector and T2H Tasks that run concurrently. Every table moved using the Hadoop Connector or T2H is encapsulated as a single task respectively.

You can limit the number of Hadoop connector tasks that run concurrently by setting a value for the hadoop.connector.max.task.slot property using the save_configuration command. You can also limit the number of T2H tasks that run concurrently by setting a value for the sqlh.max.task.slot property using the save_configuration command.

The default value for hadoop.connector.max.task.slot and sqlh.max.task.slot is 2.

Known Issues

If you cannot retrieve metadata from WebHCAT and see error 143 in dmDaemon.log, then you can increase the timeout value for WebHCAT parameter templeton.exec.timeout.
If the number of Oozie jobs started concurrently is higher than the number of mappers in the Hadoop cluster, the Oozie jobs are deadlocked. The workaround is to not allow Oozie jobs to start concurrently by adding a wait time between each Oozie job. The property hadoop.task.throttle, located in agent.properties, adds a wait time (in seconds) between two jobs. Use this property if you are having deadlock issues related to Oozie.
Tune T2H on Teradata Database to make sure proper handling of concurrent T2H queries. If the number of concurrent T2H queries is greater than that configured in Teradata Database, T2H might become unstable and cause errors such as a PermGen exception or stopping the T2H query, which requires a restart of the UDF server process. You can throttle the maximum number of T2H queries Data Mover runs concurrently using the sqlh.max.task.slot property in Data Mover configuration.