| |
- WriteNOS(data=None, location=None, authorization=None, stored_as='PARQUET', naming='RANGE', manifest_file=None, manifest_only=False, overwrite=False, include_ordering=None, include_hashby=None, max_object_size='16MB', compression=None, **generic_arguments)
- DESCRIPTION:
WriteNOS() function enables access to write input teradataml DataFrame to external storage,
like Amazon S3, Azure Blob storage, or Google Cloud Storage.
You must have the EXECUTE FUNCTION privilege on TD_SYSFNLIB.WRITE_NOS.
PARAMETERS:
data:
Required Argument.
Specifies the input teradataml DataFrame.
Types: teradataml DataFrame
location:
Optional Argument.
Specifies the location value, which is a Uniform Resource Identifier
(URI) pointing to location in the external object storage system.
A URI identifying the external storage system in the format:
/connector/endpoint/bucket_or_container/prefix
The "location" string cannot exceed 2048 characters.
Types: str
authorization:
Optional Argument.
Specifies the authorization for accessing the external storage. On
any platform, you can specify an authorization object
([DatabaseName.]AuthorizationObjectName). You must have the EXECUTE
privilege on AuthorizationObjectName.
On Amazon S3, Azure Blob storage, and Azure Data Lake Storage Gen2,
you can specify either an authorization object or a string in JSON
format. The string specifies the USER (identification) and PASSWORD
(secret_key) for accessing external storage. The following table
shows the supported credentials for USER and PASSWORD (used
in the CREATE AUTHORIZATION command):
+-----------------------+--------------------------+--------------------------+
| System/Scheme | USER | PASSWORD |
+-----------------------+--------------------------+--------------------------+
| AWS | Access Key ID | Access Key Secret |
| | | |
| Azure / Shared Key | Storage Account Name | Storage Account Key |
| | | |
| Azure Shared Access | Storage Account Name | Account SAS Token |
| Signature (SAS) | | |
| | | |
| Google Cloud | Access Key ID | Access Key Secret |
| (S3 interop mode) | | |
| | | |
| Google Cloud (native) | Client Email | Private Key |
| | | |
| Public access object | <empty string> | <empty string> |
| stores | Enclose the empty string | Enclose the empty string |
| | in single straight | in single straight |
| | quotes: USER '' | quotes: PASSWORD '' |
+-----------------------+--------------------------+--------------------------+
If you use a function mapping to define a wrapper for WRITE_NOS SQLE Engine
function, you can specify the authorization in the function mapping.
To set the function mapping for WriteNOS() function, set the following
configuration options with function mapping name.
teradataml.options.configure.write_nos_function_mapping = "<function mapping name>"
With function mappings, you can use only [ INVOKER | DEFINER ] TRUSTED,
not system-wide authorization. If an AWS IAM credential provides access,
you can omit the AUTHORIZATION clause.
Types: str
stored_as:
Optional Argument.
Specifies the formatting style of the external data.
PARQUET means the external data is formatted as Parquet. Objects
created in external storage by WriteNOS are written only in Parquet
format.
Permitted Values:
* PARQUET
Default Value: "PARQUET"
Types: str
naming:
Optional Argument.
Specifies how the objects containing the rows of data are named
in the external storage:
Permitted Values:
* DISCRETE-
Discrete naming uses the ordering column values as part of the object
names in external storage. For example, if the "data_partition_column"
has "data_order_column = ["dateColumn", "intColumn"]", the discrete form name
of the objects written to external storage would include the values for
those columns as part of the object name, which would look similar to this:
S3/ceph-s3.teradata.com/xz186000/2019-03-01/13/object_33_0_1.parquet
2019-03-01 is the value for dateColumn, the first ordering column, and 13 is
the value for the second ordering column, intColumn.
All rows stored in this external Parquet-formatted object contain those
two values.
* RANGE-
Range naming, the default, includes as part of the object name the
range of values included in the partition for each ordering column.
For example, using the same "data_order_column" as above the object
names would look similar to this:
S3/ceph-s3.teradata.com/xz186000/2019-01-01/2019-03-02/9/10000/object_33_0_1.parquet
where 2019-01-01 is the minimum value in that object for the first
ordering column, dateColumn, 2019-03-02 is the maximum value for the
rows stored in this external Parquet-formatted object. Value 9 is the
minimum value for the second ordering column, intColumn, and 10000 is
the maximum value for that column.
Default Value: "RANGE"
Types: str
manifest_file:
Optional Argument.
Specifies the fully qualified path and file name where the manifest
file is written. Use the format
"/connector/end point/bucket_or_container/prefix/manifest_file_name"
For example:
"/S3/ceph-s3.teradata.com/xz186000/manifest/manifest.json"
If you do not include the "manifest_file" argument, no manifest file is
written.
Types: str
manifest_only:
Optional Argument.
Specifies wheather to write only a manifest file in external storage
or not. No actual data objects are written to external storage if you
use "manifest_only = True". You must also use the "manifest_file" option
to create a manifest file in external storage. Use this option to
create a new manifest file in the event that a WriteNOS operation
fails due to a database abort or restart, or when network
connectivity issues interrupt and stop a WriteNOS operation before
all data has been written to external storage. The manifest is
created from the teradataml DataFrame or query result set that is
input to WriteNOS. The input must be a list of storage object names
and sizes, with one row per object.
Note: The input to WriteNOS with manifest_only can itself incorporate
ReadNOS, similar to this, which uses function mappings for WriteNOS
and ReadNOS:
read_nos_obj = ReadNOS(return_type="NOSREAD_KEYS")
WriteNOS(data=read_nos_obj.result.filter(items =['Location', 'ObjectLength']),
manifest_file=manifest_file,
manifest_only=True)
A query like this can be used if a WriteNOS operation fails before
it can create a manifest file. The new manifest file created using
ReadNOS will reflect all data objects currently in the external
storage location, and can aid in determining which data objects
resulted from the incomplete WriteNOS operation. For more
information, see Teradata Vantage™ - Native Object Store Getting
Started Guide, B035-1214.
Default Value: False
Types: bool
overwrite:
Optional Argument.
Specifies whether an existing manifest file in external storage will
be overwritten with a new manifest file that has the same name.
If False, WriteNOS returns an error if a manifest file exists in
external storage that is named identically to the value of
"manifest_file".
Note: Overwrite must be used with "manifest_only" set to True.
Default Value: False
Types: bool
include_ordering:
Optional Argument.
Specifies whether column(s) specified in argument "data_order_column" and
their values are written to external storage.
Types: bool
include_hashby:
Optional Argument.
Specifies whether column(s) specified in argument "data_hash_hash" and
their values are written to external storage.
Types: bool
max_object_size:
Optional Argument.
Specifies the maximum output object size in megabytes, where
maximum object size can range between 4 and 16. The default is the
value of the DefaultRowGroupSize field in DBS Control. For more
information on DBS Control, see Teradata Vantage™ - Database
Utilities , B035-1102.
Default Value: "16MB"
Types: str
compression:
Optional Argument.
Specifies the compression algorithm used to compress the objects
written to external storage.
Note:
For Parquet files the compression occurs inside parts of the parquet
file instead of for the entire file, so the file extension on
external objects remains '.parquet'.
Permitted Values: GZIP, SNAPPY
Types: str
**generic_arguments:
Specifies the generic keyword arguments SQLE functions accept.
Below are the generic keyword arguments:
persist:
Optional Argument.
Specifies whether to persist the results of the function in table or not.
When set to True, results are persisted in table; otherwise, results
are garbage collected at the end of the session.
Default Value: False
Types: boolean
volatile:
Optional Argument.
Specifies whether to put the results of the function in volatile table or not.
When set to True, results are stored in volatile table, otherwise not.
Default Value: False
Types: boolean
Function allows the user to partition, hash, order or local order the input
data. These generic arguments are available for each argument that accepts
teradataml DataFrame as input and can be accessed as:
* "<input_data_arg_name>_partition_column" accepts str or list of str (Strings)
* "<input_data_arg_name>_hash_column" accepts str or list of str (Strings)
* "<input_data_arg_name>_order_column" accepts str or list of str (Strings)
* "local_order_<input_data_arg_name>" accepts boolean
Note:
These generic arguments are supported by teradataml if the underlying
SQLE Engine function supports it, else an exception is raised.
RETURNS:
Instance of WriteNOS.
Output teradataml DataFrames can be accessed using attribute
references, such as WriteNOSObj.<attribute_name>.
Output teradataml DataFrame attribute name is:
result
RAISES:
TeradataMlException, TypeError, ValueError
EXAMPLES:
# Notes:
# 1. Get the connection to Vantage, before importing the function in user space.
# 2. User can import the function, if it is available on the Vantage user is connected to.
# 3. To check the list of analytic functions available on the Vantage user connected to,
# use "display_analytic_functions()".
# Load the example data.
load_example_data("teradataml", ["titanic"])
# Create teradataml DataFrame object.
titanic_data = DataFrame.from_table("titanic")
# Check the list of available analytic functions.
display_analytic_functions()
# Example 1: Writing the DataFrame to "location".
obj = WriteNOS(data = titanic_data,
location='/S3/s3.amazonaws.com/Your-Bucket/location/',
authorization='{"Access_ID": "YOUR-ID", "Access_Key": "YOUR-KEY"}',
stored_as='PARQUET')
# Print the result DataFrame.
print(obj.result)
# Note: Before proceeding, verify with your database administrator that you have the correct privileges,
# an authorization object, and a function mapping (for WRITE_NOS SQLE Engine function).
# Below examples assume use of function mapping "WRITE_NOS_FM" with location
# and authorization object.
# Setting function mapping with options.
from teradataml.options.configure import configure
configure.write_nos_function_mapping = "WRITE_NOS_FM"
# Example 2: Writing the DataFrame to "location" using function mapping.
obj = WriteNOS(data=titanic_data, stored_as='PARQUET')
# Print the result DataFrame.
print(obj.result)
# Example 3: Write DataFrame to external storage with partitioning.
obj = WriteNOS(data=titanic_data,
data_partition_columns="sex",
data_order_columns="sex",
stored_as='PARQUET')
# Print the result DataFrame.
print(obj.result)
# Example 4: Write DataFrame to external storage with max object size as 1MB.
obj = WriteNOS(data=titanic_data,
include_ordering=True,
max_object_size='4MB',
compression='GZIP',
stored_as='PARQUET')
# Print the result DataFrame.
print(obj.result)
# Example 5: Write DataFrame to external storage with hashing.
obj = WriteNOS(data=titanic_data,
data_hash_columns="sex",
data_order_columns="sex",
local_order_data=True,
include_hashing=True,
stored_as='PARQUET')
# Print the result DataFrame.
print(obj.result)
# Example 6: Write a portion of data from a database table to external object store.
query = DataFrame.from_query("select * from admissions_train where masters='yes'")
obj = WriteNOS(data=query, stored_as='PARQUET')
# Print the result DataFrame.
print(obj.result)
# Example 7: Write DataFrame manifest_table to the manifest file on external object store.
import pandas as pd
obj_names = ['/S3/s3.amazonaws.com/YOUR-STORAGE-ACCOUNT/20180701/ManifestFile/object_33_0_1.parquet',
'/S3/s3.amazonaws.com/YOUR-STORAGE-ACCOUNT/20180701/ManifestFile/object_33_6_1.parquet',
'/S3/s3.amazonaws.com/YOUR-STORAGE-ACCOUNT/20180701/ManifestFile/object_33_1_1.parquet']
obj_size = [2803, 2733, 3009]
manifest_table = pd.DataFrame({'ObjectName':obj_names, 'ObjectSize': obj_size})
copy_to_sql(manifest_table, "manifest_table")
manifest_table = DataFrame("manifest_table")
obj = WriteNOS(data=manifest_table,
location='YOUR-STORAGE-ACCOUNT/20180701/ManifestFile2/',
manifest_file='YOUR-STORAGE-ACCOUNT/20180701/ManifestFile2/manifest2.json',
manifest_only=True,
stored_as='PARQUET')
# Print the result DataFrame.
print(obj.result)
|