About Copying Tables with Hadoop

Teradata Data Mover User Guide

brand
Analytical Ecosystem
prodname
Teradata Data Mover
vrm_release
16.10
category
User Guide
featnum
B035-4101-067K

Data Mover supports copying data between Teradata Database and Hadoop using TDCH or QueryGrid. Copying data between Teradata and Hadoop is similar to copying data between Teradata systems, with the exception that if a user is copying from Teradata to Hadoop, overwrite_existing_objects is set to false, and the target table exists, Data Mover appends the partial data to the existing Hive table and does not delete any rows from the table. The DELETE/INSERT, UPDATE, and MERGE functions are not used.

Both QueryGrid and TDCH support copying partial tables between Teradata and Hadoop. QueryGrid supports copying partial tables from Teradata to Hadoop and from Hadoop to Teradata. TDCH supports copying partial tables from Teradata to Hadoop, but not from Hadoop to Teradata. When TDCH is used to copy a partial table from Hadoop to Teradata, Data Mover copies the entire table into a staging table on the Teradata Database and then runs the where clause from the staging table to the target table.

Usage Notes

  • TDCH is automatically selected when a foreign server is not provided.
  • Data Mover only supports Hive on Hadoop.
  • The Hive user must have read and write permissions on the Hadoop objects - incorrect permissions will cause the job to fail.
  • When using QueryGrid, the Hive user specified in the Data Mover job definition and on the foreign server must have the same read and write permissions on the underlying Hadoop objects.

Sample Partial Copy XML Files

The following is a sample Teradata-to-Hadoop job_definition partial table copy using TDCH.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<dmCreate xmlns="http://schemas.teradata.com/dataMover/v2009"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://schemas.teradata.com/dataMover/v2009/DataMover.xsd">
	
	<!-- source td credentials -->
	<source_tdpid>td_database</source_tdpid>
	<source_user>username</source_user>
	<source_password>password</source_password>

	<!-- target hadoop credentials -->
	<target_hadoop_system>
		<webHCat_url>http://webhcat_host:50111</webHCat_url>
		<oozie_url>http://oozie_host:11000</oozie_url>
		<file_system_url>http://webhdfs_or_httpfs_host:50070</file_system_url>
		<hive_user>hive</hive_user>
	</target_hadoop_system>
	<!--  true to truncate table on hive, false to append the data to the table on hive-->
	<overwrite_existing_objects>false</overwrite_existing_objects>
	<!--  target database on hive -->
	<target_database>
		<name>target_database</name>
	</target_database>

	<!-- object selection. in this case, we are moving source table on
		(td) to default target_database on hive -->
	<database selection="unselected">
		<name>source_database</name>
		<table selection="included">
			<name>source_table</name>
			<target_name>target_table_optional</target_name>
			<!--  where clause to select partial table on source -->
			<sql_where_clause> <![CDATA[WHERE columnValue <= num ]]></sql_where_clause>
		</table>
	</database>
</dmCreate>
The following is a sample Hadoop-to-Teradata job_definition partial table copy using QueryGrid.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<dmCreate xmlns="http://schemas.teradata.com/dataMover/v2009"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://schemas.teradata.com/dataMover/v2009/DataMover.xsd">

	<!-- source hadoop credentials -->
	<source_hadoop_system>
		<webHCat_url>http://webhcat_host:50111</webHCat_url>
		<oozie_url>http://oozie_host:11000</oozie_url>
		<file_system_url>http://webhdfs_or_httpfs_host:50070</file_system_url>
		<hive_user>hive</hive_user>
	</source_hadoop_system>

	<!-- target td credentials -->
	<target_tdpid>td_database</target_tdpid>
	<target_user>username</target_user>
	<target_password>password</target_password>

	<!--  target database on hive -->
	<target_database>
		<name>target_database</name>
	</target_database>
	<use_foreign_server>
		<name>querygrid_foreign_server_name</name>
	</use_foreign_server>

	<!-- object selection. in this case, we are moving source table on
		(td) to default target_database on hive -->
	<database selection="unselected">
		<name>source_hive_database</name>
		<table selection="included">
			<name>source_table</name>
			<target_name>target_table_optional</target_name>
			<!--  where clause to select partial table on source -->
			<sql_where_clause> <![CDATA[WHERE columnValue <= num ]]></sql_where_clause>
		</table>
	</database>
</dmCreate>