| |
- WriteNOS(data=None, location=None, authorization=None, stored_as='PARQUET', naming='RANGE', manifest_file=None, manifest_only=False, overwrite=False, include_ordering=None, include_hashby=None, max_object_size='16MB', compression=None, **generic_arguments)
- DESCRIPTION:
WriteNOS() function enables access to write input teradataml DataFrame to external storage,
like Amazon S3, Azure Blob storage, or Google Cloud Storage.
You must have the EXECUTE FUNCTION privilege on TD_SYSFNLIB.WRITE_NOS.
PARAMETERS:
data:
Required Argument.
Specifies the input teradataml DataFrame.
Types: teradataml DataFrame
location:
Optional Argument.
Specifies the location value, which is a Uniform Resource Identifier
(URI) pointing to location in the external object storage system.
A URI identifying the external storage system in the format:
/connector/endpoint/bucket_or_container/prefix
The "location" string cannot exceed 2048 characters.
Types: str
authorization:
Optional Argument.
Specifies the authorization for accessing the external storage.
Following are the options for using "authorization":
* Passing access id and key as dictionary:
authorization = {'Access_ID': 'access_id', 'Access_Key': 'secret_key'}
* Passing access id and key as string in JSON format:
authorization = '{"Access_ID":"access_id", "Access_Key":"secret_key"}'
* Using Authorization Object:
On any platform, authorization object can be specified as
authorization = "[DatabaseName.]AuthorizationObjectName"
EXECUTE privilege on AuthorizationObjectName is needed.
Below table specifies the USER (identification) and PASSWORD (secret_key)
for accessing external storage. The following table shows the supported
credentials for USER and PASSWORD (used in the CREATE AUTHORIZATION SQL command):
+-----------------------+--------------------------+--------------------------+
| System/Scheme | USER | PASSWORD |
+-----------------------+--------------------------+--------------------------+
| AWS | Access Key ID | Access Key Secret |
| | | |
| Azure / Shared Key | Storage Account Name | Storage Account Key |
| | | |
| Azure Shared Access | Storage Account Name | Account SAS Token |
| Signature (SAS) | | |
| | | |
| Google Cloud | Access Key ID | Access Key Secret |
| (S3 interop mode) | | |
| | | |
| Google Cloud (native) | Client Email | Private Key |
| | | |
| Public access object | <empty string> | <empty string> |
| stores | Enclose the empty string | Enclose the empty string |
| | in single straight | in single straight |
| | quotes: USER '' | quotes: PASSWORD '' |
+-----------------------+--------------------------+--------------------------+
When accessing GCS, Analytics Database uses either the S3-compatible connector or
the native Google connector, depending on the user credentials.
Notes:
* If using AWS IAM credentials, "authorization" can be omitted.
* If S3 user account requires the use of physical or virtual security,
session token can be used with Access_ID and Access_Key in this syntax:
authorization = {"Access_ID":"access_id",
"Access_Key":"secret_key",
"Session_Token":"session_token"}
In which, the session token can be obtained using the AWS CLI.
For example:
aws sts get-session-token
Types: str or dict
stored_as:
Optional Argument.
Specifies the formatting style of the external data.
PARQUET means the external data is formatted as Parquet. Objects
created in external storage by WriteNOS are written only in Parquet
format.
Permitted Values:
* PARQUET
Default Value: "PARQUET"
Types: str
naming:
Optional Argument.
Specifies how the objects containing the rows of data are named
in the external storage:
Permitted Values:
* DISCRETE-
Discrete naming uses the ordering column values as part of the object
names in external storage. For example, if the "data_partition_column"
has "data_order_column = ["dateColumn", "intColumn"]", the discrete form name
of the objects written to external storage would include the values for
those columns as part of the object name, which would look similar to this:
S3/ceph-s3.teradata.com/xz186000/2019-03-01/13/object_33_0_1.parquet
2019-03-01 is the value for dateColumn, the first ordering column, and 13 is
the value for the second ordering column, intColumn.
All rows stored in this external Parquet-formatted object contain those
two values.
* RANGE-
Range naming, the default, includes as part of the object name the
range of values included in the partition for each ordering column.
For example, using the same "data_order_column" as above the object
names would look similar to this:
S3/ceph-s3.teradata.com/xz186000/2019-01-01/2019-03-02/9/10000/object_33_0_1.parquet
where 2019-01-01 is the minimum value in that object for the first
ordering column, dateColumn, 2019-03-02 is the maximum value for the
rows stored in this external Parquet-formatted object. Value 9 is the
minimum value for the second ordering column, intColumn, and 10000 is
the maximum value for that column.
Default Value: "RANGE"
Types: str
manifest_file:
Optional Argument.
Specifies the fully qualified path and file name where the manifest
file is written. Use the format
"/connector/end point/bucket_or_container/prefix/manifest_file_name"
For example:
"/S3/ceph-s3.teradata.com/xz186000/manifest/manifest.json"
If "manifest_file" argument is not included, no manifest file is
written.
Types: str
manifest_only:
Optional Argument.
Specifies wheather to write only a manifest file in external storage
or not. No actual data objects are written to external storage when
"manifest_only" is set to True. One must also use the "manifest_file"
option to create a manifest file in external storage. Use this option
to create a new manifest file in the event that a WriteNOS()
fails due to a database abort or restart, or when network connectivity
issues interrupt and stop a WriteNOS() before all data has
been written to external storage. The manifest is created from the
teradataml DataFrame that is input to WriteNOS().
The input must be a DataFrame of storage object names and sizes, with
one row per object.
Note:
The input to WriteNOS() with "manifest_only" can itself incorporate
ReadNOS(), similar to this, which uses function mappings for WriteNOS()
and ReadNOS():
read_nos_obj = ReadNOS(return_type="NOSREAD_KEYS")
WriteNOS(data=read_nos_obj.result.filter(items =['Location', 'ObjectLength']),
manifest_file=manifest_file,
manifest_only=True)
Function calls like above can be used if a WriteNOS() fails before
it can create a manifest file. The new manifest file created using
ReadNOS() reflects all data objects currently in the external
storage location, and can aid in determining which data objects
resulted from the incomplete WriteNOS().
For more information, see Teradata Vantage - Native Object Store
Getting Started Guide.
Default Value: False
Types: bool
overwrite:
Optional Argument.
Specifies whether an existing manifest file in external storage is
overwritten with a new manifest file that has the same name.
When set to False, WriteNOS() returns an error if a manifest file exists in
external storage that is named identically to the value of
"manifest_file".
Note:
Overwrite must be used with "manifest_only" set to True.
Default Value: False
Types: bool
include_ordering:
Optional Argument.
Specifies whether column(s) specified in argument "data_order_column" and
their values are written to external storage.
Types: bool
include_hashby:
Optional Argument.
Specifies whether column(s) specified in argument "data_hash_column" and
their values are written to external storage.
Types: bool
max_object_size:
Optional Argument.
Specifies the maximum output object size in megabytes, where
maximum object size can range between 4 and 16. The default is the
value of the DefaultRowGroupSize field in DBS Control.
For more information on DBS Control, see Teradata Vantage - Database
Utilities.
Default Value: "16MB"
Types: str
compression:
Optional Argument.
Specifies the compression algorithm used to compress the objects
written to external storage.
Note:
For Parquet files the compression occurs inside parts of the parquet
file instead of for the entire file, so the file extension on
external objects remains '.parquet'.
Permitted Values: GZIP, SNAPPY
Types: str
**generic_arguments:
Specifies the generic keyword arguments SQLE functions accept.
Below are the generic keyword arguments:
persist:
Optional Argument.
Specifies whether to persist the results of the function in table or not.
When set to True, results are persisted in table; otherwise, results
are garbage collected at the end of the session.
Default Value: False
Types: boolean
volatile:
Optional Argument.
Specifies whether to put the results of the function in volatile table or not.
When set to True, results are stored in volatile table, otherwise not.
Default Value: False
Types: boolean
Function allows the user to partition, hash, order or local order the input
data. These generic arguments are available for each argument that accepts
teradataml DataFrame as input and can be accessed as:
* "<input_data_arg_name>_partition_column" accepts str or list of str (Strings)
* "<input_data_arg_name>_hash_column" accepts str or list of str (Strings)
* "<input_data_arg_name>_order_column" accepts str or list of str (Strings)
* "local_order_<input_data_arg_name>" accepts boolean
Note:
These generic arguments are supported by teradataml if the underlying
Analytic Database function supports it, else an exception is raised.
RETURNS:
Instance of WriteNOS.
Output teradataml DataFrames can be accessed using attribute
references, such as WriteNOSObj.<attribute_name>.
Output teradataml DataFrame attribute name is:
result
RAISES:
TeradataMlException, TypeError, ValueError
EXAMPLES:
# Notes:
# 1. Get the connection to Vantage to execute the function.
# 2. One must import the required functions mentioned in
# the example from teradataml.
# 3. Function will raise error if not supported on the Vantage
# user is connected to.
# Check the list of available analytic functions.
display_analytic_functions()
# Load the example data.
load_example_data("teradataml", "titanic")
# Create teradataml DataFrame object.
titanic_data = DataFrame.from_table("titanic")
# Example 1: Writing the DataFrame to an AWS S3 location.
obj = WriteNOS(data=titanic_data,
location='/S3/s3.amazonaws.com/Your-Bucket/location/',
authorization={"Access_ID": "YOUR-ID", "Access_Key": "YOUR-KEY"},
stored_as='PARQUET')
# Print the result DataFrame.
print(obj.result)
# Example 2: Write DataFrame to external storage with partition and order by
# column 'sex'.
obj = WriteNOS(data=titanic_data,
location='/S3/s3.amazonaws.com/Your-Bucket/location/',
authorization={"Access_ID": "YOUR-ID", "Access_Key": "YOUR-KEY"},
data_partition_columns="sex",
data_order_columns="sex",
stored_as='PARQUET')
# Print the result DataFrame.
print(obj.result)
# Example 3: Write DataFrame to external storage with hashing and order by
# column 'sex'.
obj = WriteNOS(data=titanic_data,
location='/S3/s3.amazonaws.com/Your-Bucket/location/',
authorization={"Access_ID": "YOUR-ID", "Access_Key": "YOUR-KEY"},
data_hash_columns="sex",
data_order_columns="sex",
local_order_data=True,
include_hashing=True,
stored_as='PARQUET')
# Print the result DataFrame.
print(obj.result)
# Example 4: Write DataFrame to external storage with max object size as 4MB.
obj = WriteNOS(data=titanic_data,
location='/S3/s3.amazonaws.com/Your-Bucket/location/',
authorization={"Access_ID": "YOUR-ID", "Access_Key": "YOUR-KEY"},
include_ordering=True,
max_object_size='4MB',
compression='GZIP',
stored_as='PARQUET')
# Print the result DataFrame.
print(obj.result)
# Example 5: Write DataFrame of manifest table into a manifest file.
import pandas as pd
obj_names = ['/S3/s3.amazonaws.com/YOUR-STORAGE-ACCOUNT/20180701/ManifestFile/object_33_0_1.parquet',
'/S3/s3.amazonaws.com/YOUR-STORAGE-ACCOUNT/20180701/ManifestFile/object_33_6_1.parquet',
'/S3/s3.amazonaws.com/YOUR-STORAGE-ACCOUNT/20180701/ManifestFile/object_33_1_1.parquet']
obj_size = [2803, 2733, 3009]
manifest_df = pd.DataFrame({'ObjectName':obj_names, 'ObjectSize': obj_size})
from teradataml import copy_to_sql
copy_to_sql(manifest_df, "manifest_df")
manifest_df = DataFrame("manifest_df")
obj = WriteNOS(data=manifest_df,
location='YOUR-STORAGE-ACCOUNT/20180701/ManifestFile2/',
authorization={"Access_ID": "YOUR-ID", "Access_Key": "YOUR-KEY"},
manifest_file='YOUR-STORAGE-ACCOUNT/20180701/ManifestFile2/manifest2.json',
manifest_only=True,
stored_as='PARQUET')
# Print the result DataFrame.
print(obj.result)
# Note:
# Before proceeding, verify with your database administrator that you have the
# correct privileges, an authorization object, and a function mapping (for WRITE_NOS
# function).
# If function mapping for WRITE_NOS Analytic database function is created
# as 'WRITE_NOS_FM' and location and authorization object are mapped,
# then set function mapping with teradataml options as below.
# Example 7: Writing the DataFrame using function mapping.
from teradataml.options.configure import configure
configure.write_nos_function_mapping = "WRITE_NOS_FM"
obj = WriteNOS(data=titanic_data, stored_as='PARQUET')
# Print the result DataFrame.
print(obj.result)
|