The use of the FileName attribute varies depending on operator type, operating system, and whether the file resides in the local filesystem or in Hadoop's distributed file system. The traditional DataConnector attributes, including FileName, are used when interfacing with Hadoop via the HDFS API interface, but are not used when interfacing with Hadoop through TDCH. For more information about the DataConnector's Hadoop interfaces, see Processing Hadoop Files and Tables.
- DataConnector Producer Operator
When using the DataConnector operator as a producer to read data from files in the local file system, the wildcard character (*) is allowed in a FileName attribute if you want to process all matching files or members within a named UNIX OS directory or the z/OS partitioned dataset (PDS or PDSE). Wildcard UNIX-style “egrep” patterns are also supported when using the DataConnector operator as a producer to read Hadoop files via the HDFS API interface.
The following conditions also apply depending on your operating system:- On UNIX systems, the FileName attribute limit is 255 bytes. FileName can either be a complete pathname of the file, or the name of a file within a directory. But if the directory is not defined in the optional DirectoryPath attribute, filename is expected to be found in the default directory. See the table in this section for examples.
- On z/OS systems, FileName can specify the fully-qualified dataset name of the data file, including the member name if the dataset is a PDS or PDSE, the member name of a PDS or PDSE library, or 'DD:<ddname>'. If only a member name is specified for the FileName, then the (PDS or PDSE) dataset name containing the member must be specified by the DirectoryPath attribute. See the table in this section for examples.
- DataConnector Consumer Operator
When using the DataConnector operator as a consumer, the FileName attribute becomes the complete file specification, and the FileName cannot contain the wildcard character (*).
On UNIX systems, unless you specify a pathname, the FileName is expected to be found in the default directory. See the table in this section for examples.
When writing files whose FileName value is not fully qualified into the Hadoop distributed file system via the HDFS API interface, the file will be created in the directory of the user specified by the HadoopUser attribute.
Writing Files with Multiple Instances:
The Consumer Operator can write multiple files with multiple instances. When more than one instance is specified, the output will be spread across multiple files. The distribution of records in each file will not be uniform. If you want each file to have the same (approximate) number of records, use the -C command line option, which forces the rows to the consumer operator in a round-robin fashion.
There are two methods by which the output files can be specified:- Use the FileList attribute and name each file explicitly in the FileList file.
- Allow the DataConnector Operator to generate the file names automatically.
When the file names are created automatically, the following procedures will be used:
UNIX File Names
File names will be created by adding a dash and a number starting with one and incremented by one. For example, a file-name of "aa" with two instances will create the following two files:
– aa-1
– aa-2
ZIP and GZIP File Names
When ZIP and GZIP files are created, the naming is slightly different from UNIX file names because the file-type extension must be preserved. Using the example above with the file names “aa.gz"/"aa.zip" and two instances, the following two files will be created:
– aa-1.gz/aa-1.zip
– aa-2.gz/aa-2.zip
Z/OS Datasets
The multiple writer feature is enabled on z/OS by generating sequential "dd-names." This feature also supports using a PDS/PDSE dataset syntax expression to generate sequential PDS member names.
There are two accepted formats:
– DD-NAME FORMAT: { DD:xxx }
– DATASET FORMAT: { //'dataset-name(member-name)' }
The dd-name (xxx) and the (member-name) will be used as templates. The dd-name or member name provided can be 0-8 characters.
When there are (0) characters the following will be generated in sequential order:
– "D0000001"
– "D0000002"
Etc. ( ... )
The maximum number of decimal digits will always be generated with zero fill.
When the number of characters provided is greater than the number of digits required, the digits will expand into and replace (x) characters where x = required - available.
When the number of characters provided is less than the number of required digits, the decimal digits will be expanded with zero fill so that in all cases the dd-name or member name will occupy 8 characters.
DD-NAME Examples
Zero Characters "DD:" with 2 instances:
– "DD:D0000001"
– "DD:D0000002"
Two Characters "DD:XX" with 10 Instances:
– "DD:XX000001"
– "DD:XX000002"
– "DD:XX000003"
– "DD:XX000004"
– "DD:XX000005"
– "DD:XX000006"
– "DD:XX000007"
– "DD:XX000008"
– "DD:XX000009"
– "DD:XX000010"
Eight Characters "DD:XXXXXXXX" with 5 Instances:
– "DD:XXXXXXX1"
– "DD:XXXXXXX2"
– "DD:XXXXXXX3"
– "DD:XXXXXXX4"
– "DD:XXXXXXX5"
Eight Characters "DD:XXXXXXXX" with 10 Instances:
– "DD:XXXXXX01"
– "DD:XXXXXX02"
– "DD:XXXXXX03"
– "DD:XXXXXX04"
– "DD:XXXXXX05"
– "DD:XXXXXX06"
– "DD:XXXXXX07"
– "DD:XXXXXX08"
– "DD:XXXXXX09"
– "DD:XXXXXX10"
Dataset Member-Name Examples
Zero Characters "//'MYDSN()'" with 2 instances:
– "//'MYDSN(D0000001)'"
– "//'MYDSN(D0000002)'"
Two Characters "//'MYDSN(XX)'" with 10 Instances:
– "//'MYDSN(XX000001)'"
– "//'MYDSN(XX000002)'"
– "//'MYDSN(XX000003)'"
– "//'MYDSN(XX000004)'"
– "//'MYDSN(XX000005)'"
– "//'MYDSN(XX000006)'"
– "//'MYDSN(XX000007)'"
– "//'MYDSN(XX000008)'"
– "//'MYDSN(XX000009)'"
– "//'MYDSN(XX000010)'"
Eight Characters "//'MYDSN(XXXXXXXX)'" with 5 Instances:
– "//'MYDSN(XXXXXXX1)'"
– "//'MYDSN(XXXXXXX2)'"
– "//'MYDSN(XXXXXXX3)'"
– "//'MYDSN(XXXXXXX4)'"
– "//'MYDSN(XXXXXXX5)'"
Eight Characters "//'MYDSN(XXXXXXXX)'" with 10 Instances:
– "//'MYDSN(XXXXXX01)'"
– "//'MYDSN(XXXXXX02)'"
– "//'MYDSN(XXXXXX03)'"
– "//'MYDSN(XXXXXX04)'"
– "//'MYDSN(XXXXXX05)'"
– "//'MYDSN(XXXXXX06)'"
– "//'MYDSN(XXXXXX07)'"
– "//'MYDSN(XXXXXX08)'"
– "//'MYDSN(XXXXXX09)'"
– "//'MYDSN(XXXXXX10)'"
- Combining FileName and FileList attributes
The FileList attribute extends the capabilities of the FileName attribute. Adding FileList = ‘Y’ indicates that the file identified by FileName contains a list of files to be processed as input or used as containers for output. The file names found within the FileName file are expected to be full path specifications. If no directory name is included, the files are expected to be located within the current directory. Supplying full paths for output files enables you to write files to multiple directories or disks. You cannot use the DirectoryPath attribute in conjunction with this feature.
When the combination of FileName and FileList attributes are used to control output, the supplied file list must have the same number of files as there are defined consumer instances; a mismatch results in a terminal error. At execution, rows are distributed to the listed files in a round-robin fashion if the tbuild -C option is used. Without the option, rows may not be evenly distributed across the listed files.
DataConnector operator supports a FileList file encoded in ASCII on workstation-attached platforms and EBCDIC on mainframe-attached platforms.
You cannot combine this feature with the archiving feature. Any attempt to use the archive feature (for example, by defining the ArchiveDirectoryPath attribute) results in a terminal error.
- Limiting the size of output files
If the DataConnector consumer operator is used to write many records to disk, the FileSizeMax attribute can be used to write records across several more manageable sized files instead of one very large file. The FileSizeMax attribute limits output files to a user-specific size. Whenever the size limit is reached, the current output file is closed and the next set of records are written to a new output file.
If the pathname that you specify with the FileName attribute (as filename) contains any embedded pathname syntax (“/ “on a UNIX OS or “\” on Windows), the pathname is accepted as the entire pathname. However, if the DirectoryPath attribute is present, the FileName attribute is ignored, and a warning message is issued.
If the FileList file-name does not exist in HDFS, then the Data Connector will assume it is a local file and process it accordingly, otherwise if it is an HDFS file it will be read from the HDFS file system.
The following table contains valid syntax examples for the FileName attribute.
Operating System | Valid Syntax | Explanation |
---|---|---|
z/OS |
FileName = '//''name.name(member)''' |
z/OS PDS DSN: Name.Name(Member) where:
|
FileName = '//''name.name''' |
z/OS DSN (sequential): Name.Name where:
|
|
FileName = 'DD:ddname' |
z/OS DSN is described in the JCL DD
statement name “ddname.” If no DD statement is specified, the following occurs:
|
|
FileName = 'member' |
z/OS PDS member is expected to reside in the DSN that is defined in the DirectoryPath attribute. | |
UNIX |
FileName = '/tmp/user/filename' |
UNIX pathname. |
FileName = 'filename' |
If the DirectoryPath attribute is undefined, filename is located in the default directory. | |
Windows |
FileName = '\tmp\user-filename'
|
Windows path name. |
FileName = 'filename' |
Windows file name expected to be
found in the directory defined in the DirectoryPath attribute. If the DirectoryPath is not defined, filename is located in the default directory. |