Compression and Decompression| Teradata Access Module for S3 - Compression and Decompression - Access Module

Teradata® Tools and Utilities Access Module Reference - 20.00

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
Lake
VMware
Product
Access Module
Release Number
20.00
Published
October 2023
ft:locale
en-US
ft:lastEdition
2024-05-14
dita:mapPath
cya1691484517272.ditamap
dita:ditavalPath
obe1474387269547.ditaval
dita:id
hjf1479308836950
lifecycle
latest
Product Category
Teradata Tools and Utilities

GZIP data compression is supported when reading and writing objects in S3.

For export, if the object name ends in ".gz", the generated objects will be compressed. In the "S3SinglePartFile=True" case, a single compressed object will be created with the name specified ending in .gz. In the "S3SinglePartFile=False" case, the apparent directory holding the F000000, F000001... files will have a name ending in .gz. The individual objects in the "apparent" directory will not have a .gz suffix. The objects are compressed even though they don't end in .gz, because the object specified DID end in .gz. Unless the DontSplitRows option was selected on write, objects must be concatenated to be uncompressed. The access module does this automatically on a load. They cannot be individually uncompressed. If manually downloaded with "aws s3 cp", the pieces retried must be concatenated and the resulting file named .gz.

For import, if the object name ends in .gz it will be decompressed. When S3SinglePartFile=False if the object name specified ends in .gz, all the Fxxxxxx files will be concatenated and uncompressed as if they were a single object, even though the Fxxxxx files don't end in .gz as discussed in the previous export description. When S3SinglePartFile=True and a wildcard specification is not used, if the object name ends in .gz it will be decompressed as it is read When S3SinglePartFile=True and a wildcard specification is used, the individual matches are inspected and decompressed, or not, depending on the presence or absence of a .gz suffix.

Each object that needs decompression is individually decompressed. The objects are concatenated AFTER the optional decompress operation and delivered to Teradata Parallel Transporter. Although it would be odd for this to happen, it is allowed to have a mixture of compressed and uncompressed objects. The concatenation of the results is a streaming operation and is not memory limited. The data is not landed on disk.This method will support the compressed file format of some other cloud databases. For instance, a list of files ending in .gz exported for a single RedShift export can be read this way. Checkpoint/Restart is implemented for compressed object but the seek phase of the restart is implemented by reading and uncompressing the object until the correct location is found. This will probably still be faster than restarting the job from the beginning.