Active Directory Scan: Continuous Loading of Transactional Data

Active Directory Scan: Continuous Loading of Transactional Data - Parallel Transporter

Teradata Parallel Transporter User Guide

Product

Parallel Transporter

Release Number

16.10

Published

May 2017

Language

English (United States)

Last Update

2018-05-09

dita:mapPath

vyv1488824663502.ditamap

dita:ditavalPath

Audience_PDF_product_tpt_userguide_include.ditaval

dita:id

B035-2445

lifecycle

Product Category

Teradata Tools and Utilities

Transactional data is collected and stored in client directories. You can use the “active directory scan” feature to continuously collect data from these directories based on a user-defined time interval for scanning the directory, and a start and stop time for the whole scan job, using the Data Connector operator.

All files present in the source directories that meet the user-specified file name criteria (which include “wildcard” specifications) are processed by the Data Connector operator. Whenever the defined scan interval expires, the Data Connector operator scans the directory and looks for new files that have entered the directory since the last scan. It then reads the rows from each of the files collected and sends them to the consumer operator, which is usually the Stream operator, for purposes of continuous loading. If no new files are found during the directory scan, the Data Connector operator waits for the defined interval to expire before scanning the directory again.

Strategy

Consider the following when setting up a job for Active Directory Scan:

Specify the attribute names and values for the standard attributes required for the DataConnector operator; FileName, Format, IndicatorMode (where required), and TextDelimiter (required if format is “delimited”).
For information on use of these standard attributes, see the chapter on the DataConnector operator in Teradata Parallel Transporter Reference (B035-2436).
Use the wildcard character ( * ) for the FileName attribute according to one of the following strategies:
- Specify “*” to instruct the DataConnector operator to scan and extract data from all files in the directory.
- Specify “abc.*” to instruct the DataConnector operator to scan for all files in the directory having file names that begin with the specified character string.
Specify the directory to be scanned using the DirectoryPath attribute, in the form:
```
DirectoryPath=<PathName>
```
Use the ArchiveDirectoryPath attribute to specify the path for the archive directory. Once files in the directory have been scanned and their data has been extracted, this specification will cause the files to be moved from the directory identified in the DirectoryPath attribute to that specified in ArchiveDirectoryPath attribute, in order to keep the files from being scanned again.

Use the DataConnector Vigil attributes to set up the time constraints for the directory scan, as follows:

Attribute	Setup Requirements
VigilStartTime	Required to specify the start time for the initial directory scan.
VigilStopTime	Specifies the time after which no more scans will begin. Any scan that begins before the stop time will run to completion. This attribute is interchangeable with the VigilElapsedTime attribute. Using one of these two attributes is required.
VigilWaitTime	Specifies the time in seconds between the beginning of one scan and the beginning of the next scan.
VigilElapsedTime	Specifies the total time in minutes the job will scan the directory for new files in intervals defined by VigilWaitTime. Any scan that starts before the end of the specified elapsed time will run to completion.

For required syntax and detailed descriptions for all DataConnector attributes, see the Teradata Parallel Transporter Reference (B035-2436).

Active Directory Scan Options

The following options are available to further customize an Active Directory Scan.

Use several DataConnectors operating in parallel to monitor multiple data sources.
Use multiple instances of Stream operator to INSERT data into a Teradata Database table at an optimal rate.
Important optional attributes:
- Specify the VigilSortFile attribute and set it to TIME to sort files according to the time they were last modified.
- Specify the VigilNoticeFileName attribute with a file name, so that when the scan file is updated with new data, a notification will be placed in that file.
- Specify VigilMaxFiles to define the maximum number of files that can be scanned in one pass.
Multiple schemas:
When the data from the sources are not all described by UNION-compatible schemas, use column selection and/or derived columns in the Select clauses in the APPLY statement to put UNION-compatible data on the output data streams.

For a typical application of Active Directory Scan, see Job Example 9: Active Directory Scan.

For the sample script that corresponds to this job, see the following script in the sample/userguide directory:

PTS00015: Active Directory Scan.

The Active Directory Scan functionality is supported when using the HDFS API interface to process Hadoop files, but is not supported when using the TDCH-TPT interface to process Hadoop files and tables. For more information, see the “Processing Hadoop Files and Tables” in the "DataConnector Operator" section of the Teradata Parallel Transporter Reference (B035-2436).