Data Mover supports copying tables between Teradata Database and Hadoop.
Data Mover requires the following Hadoop services:
- Service for accessing data using a SQL-like language. Apache Hive provides data summarization and ad hoc querying. Data Mover supports movement of tables from Teradata to Hive and Hive to Teradata.
- HttpFS and WebHDFS
- REST service that provides a HTTP gateway to support all HDFS File System operations.
- REST API data integration services that must be enabled for HCatalog. Used to gather hive table DDL.
- Workflow scheduler system to manage, submit, and monitor Hadoop jobs. The service is used to execute TDCH jobs.
Data Mover can use either TDCH or QueryGrid to move data between Teradata and Hadoop. If TDCH and QueryGrid are both available, Data Mover uses QueryGrid.
- Command-line connector that provides a set of API and tools that support high-performance, parallel, bi-directional data movement between Teradata systems and the Hadoop ecosystem of products. For more information, see Teradata Connector for Hadoop on Teradata Developer Exchange.
- QueryGrid is Teradata software that enables the importing and exporting of data between Teradata and Hadoop. In Data Mover, QueryGrid T2H is available for Teradata Database 15.00 or later. In Data Mover, T2H supports both QueryGrid 1.x and 2.x Teradata Hadoop connectors.
For Teradata Database 15.0 or later, you must specify a foreign server to use T2H to move data between Teradata and Hadoop. If you do not specify a foreign server, Data Mover uses TDCH.
- T2H is used to copy data from Hadoop to Teradata.
- A foreign server has been specified in the job definition.
Data Mover supports moving tables with JSON columns using BSON or UBJSON storage formats when moving between Teradata Database versions 15.10 or later. When moving from Teradata Database 15.0 to Teradata Database 15.10 or later, the default text storage format is used for JSON columns.
- You cannot use TDCH to copy data to a JSON column type.
- Data Mover does not support copying a JSON column type from Teradata Database to Hadoop.
- The Hive user specified in the Data Mover job must have read/write/execute permissions on the Hive table's underlying files and folder in HDFS.
- Oozie must be able to impersonate the Hive user to execute the TDCH job.
The Teradata Connector for Hadoop also provides for data movement between Teradata Database and supported Hadoop systems secured with Kerberos. For more information, consult with Teradata Customer Services.
Object Support Restrictions
- Data Mover only allows tables to be copied between Teradata and Hive. You cannot copy other objects, such as views, indexes, or an entire database, between Teradata and Hive.
- If a user specifies column information for a table, such as name or type, and the target table already exists, the column information is used to compare DDL of source and target tables. If the target table does not exist, Data Mover uses the column information to create the target table. If you do not specify column information, the source table metadata is used for column name or type. Data Mover does not drop a target table for Hadoop support. For more information, see Objects Supported During Moves Between Databases.
- For more information about data types that are supported or restricted, see Data Type Mappings.
Data Mover allows you to limit the number of Hadoop Connector and T2H Tasks that run concurrently. Every table moved using the Hadoop Connector or T2H is encapsulated as a single task respectively.
You can limit the number of Hadoop connector tasks that run concurrently by setting a value for the hadoop.connector.max.task.slot property using the save_configuration command. You can also limit the number of T2H tasks that run concurrently by setting a value for the sqlh.max.task.slot property using the save_configuration command.
The default value for hadoop.connector.max.task.slot and sqlh.max.task.slot is 2.
- If you cannot retrieve metadata from WebHCAT and see error 143 in dmDaemon.log, then you can increase the timeout value for WebHCAT parameter templeton.exec.timeout.
- If the number of Oozie jobs executed concurrently is higher than the number of mappers in the Hadoop cluster, the Oozie jobs are deadlocked. The workaround is to not allow Oozie jobs to start concurrently by adding a wait time between each Oozie job. The property hadoop.task.throttle, located in agent.properties, adds a wait time (in seconds) between two jobs. Use this property if you are having deadlock issues related to Oozie.
- Tune T2H on Teradata Database to make sure proper handling of concurrent T2H queries. If the number of concurrent T2H queries is greater than that configured in Teradata Database, T2H might become unstable and cause errors such as a PermGen exception or stopping the T2H query, which requires a restart of the UDF server process. You can throttle the maximum number of T2H queries Data Mover runs concurrently using the sqlh.max.task.slot property in Data Mover configuration.