READ_NOS Syntax Elements - Teradata Vantage

Teradata® VantageCloud Lake

Deployment

VantageCloud

Edition

Lake

Product

Teradata Vantage

Published

January 2023

Language

English (United States)

Last Update

2024-04-03

dita:mapPath

phg1621910019905.ditamap

dita:ditavalPath

pny1626732985837.ditaval

dita:id

phg1621910019905

table_name

Name of table.

view_name

Name of view.

query_expression

Expression to specify input to table operator.

LOCATION

You must specify a LOCATION value, which is a Uniform Resource Identifier (URI) pointing to the data in external object storage. The LOCATION value includes the following components:

Amazon S3:

 /connector/bucket.endpoint/[key_prefix]

Azure Blob storage and Azure Data Lake Storage Gen2:

 /connector/storage-account.endpoint/container/[key_prefix]

Google Cloud Storage:

 /connector/endpoint/bucket/[key_prefix]

connector

Identifies the type of external object storage system where the data is located.

Teradata requires the storage location to start with the following for all external object storage locations:

Amazon S3 storage location must begin with /S3 or /s3
Azure Blob Storage location (including Azure Data Lake Storage Gen2 in Blob Interop Mode) must begin with /AZ or /az.
Google Cloud Storage location must begin with /GS or /gs

storage-account

Used by Azure. The Azure storage account contains your Azure storage data objects.

endpoint

A URL that identifies the system-specific entry point for external object storage.

bucket (Amazon S3, GCS) or container (Azure Blob storage and Azure Data Lake Storage Gen2)

A container that logically groups stored objects in external object storage.

key_prefix

Identifies one or more objects in the logical organization of the bucket data. As a key prefix, not an actual directory path, the key prefix may match one or more objects in the external object storage. For example, the key prefix '/fabrics/cotton/colors/b/' matches objects: /fabrics/cotton/colors/blue, /fabrics/cotton/colors/brown, and /fabrics/cotton/colors/black. If there were organization levels below those, such as /fabrics/cotton/colors/blue/shirts, the same key prefix gathers those objects too.

Vantage validates only the first encountered file from the location key prefix.

For example, this LOCATION value may specify all objects on an Amazon cloud storage system for the month of December, 2001:

LOCATION('/S3/YOUR-BUCKET.s3.amazonaws.com/csv/US-Crimes/csv-files/2001/Dec/')

connector	bucket	endpoint	key_prefixes
S3	YOUR-BUCKET	s3.amazonaws.com	csv/US-Crimes/csv-files/2001/Dec/

This LOCATION can specify an individual storage object (or file), Day1.csv:

LOCATION('/S3/YOUR-BUCKET.s3.amazonaws.com/csv/US-Crimes/csv-files/2001/Dec/Day1.csv')

connector	bucket	endpoint	key_prefixes
S3	YOUR-BUCKET	s3.amazonaws.com	csv/US-Crimes/csv-files/2001/Dec/Day11.csv

This LOCATION specifies an entire container in an Azure external object storage (Azure Blob storage or Azure Data Lake Storage Gen2). The container may contain multiple file objects:

LOCATION('/AZ/YOUR-STORAGE-ACCOUNT.blob.core.windows.net/CONTAINER/nos-csv-data')

connector	storage-account	endpoint	container	key_prefixes
AZ	YOUR-STORAGE-ACCOUNT	blob.core.windows.net	CONTAINER	nos-csv-data

This is an example of a Google Cloud Storage location:

LOCATION('/gs/storage.googleapis.com/YOUR-BUCKET/CSVDATA/RIVERS/rivers.csv')

connector	endpoint	bucket	key_prefixes
GS	storage.googleapis.com	YOUR-BUCKET	CSVDATA/RIVERS/rivers.csv

AUTHORIZATION

[Optional] Authorization for accessing external object storage.

On any platform, you can specify an authorization object ([DatabaseName.]AuthorizationObjectName). You must have the EXECUTE privilege on AuthorizationObjectName.

On Amazon S3 and Azure Blob Storage and Azure Data Lake Storage Gen2, you can specify either an authorization object or a string in JSON format. The string specifies the USER (identification) and PASSWORD (secret_key) for accessing external object storage. The following table shows the supported credentials for USER and PASSWORD (used in the CREATE AUTHORIZATION command):

System/Scheme	USER/ACCESS_ID	PASSWORD/ACCESS_KEY
AWS	Access Key ID	Access Key Secret
Azure / Shared Key	Storage Account Name	Storage Account Key
Azure Shared Access Signature (SAS)	Storage Account Name	Account SAS Token
Google Cloud (S3 interop mode)	Access Key ID	Access Key Secret
Google Cloud (native)	Client Email	Private Key
On-premises object storage	Access Key ID	Access Key Secret
Public access object storage	empty_string Enclose the empty string in single straight quotation marks: USER ''	empty_string Enclose the empty string in single straight quotation marks: PASSWORD ''

The following are alternatives to using an access key or password to secure S3-compatible external object storage. These are included in an authorization object, which is created by the CREATE AUTHORIZATION command:

Amazon Identity and Access Management (IAM)
AWS Assume Role used to allow existing AWS IAM users and service accounts temporary access to AWS resources in other accounts.

The following are alternatives to using Azure Storage Name and Storage Account Key:

Azure service principal used to assign restricted permissions to applications and services accessing Azure external object storage.

If your S3 user account requires the use of physical or virtual security, you can use a session token with Access_ID and Access_KEY in this syntax:

AUTHORIZATION = '{"Access_ID":"access_id", "Access_Key":"secret_key" 
,"Session_Token":"session_token" }'

First, get a session token using the AWS CLI. For example: aws sts get-session-token.

When accessing Google Cloud Storage, Analytics Database uses either the S3-compatible connector or the native Google connector, depending on the user credentials.

In addition to Azure/Shared Key and Azure Shared Access Signature, you can use Azure Service Principal to access Azure Blob storage.

Azure Service Principal allows you to assign more restrictive permissions to applications accessing your Azure Data Lake Gen2 storage.

To use Azure Service Principal authorization, first create an authorization object for it. See CREATE AUTHORIZATION.

Then use the authorization object in the READ_NOS command in the AUTHORIZATION keyword. For example:

Create the authorization object or ask an administrator to do this:
```
CREATE AUTHORIZATION Asp_Simplified_Auth
USING
AUTHSERVICETYPE 'AZURE_SERVICE_PRINCIPAL'
CLIENT_ID 'client_id'
CLIENT_SECRET 'client_secret'
TENANT_ID 'tenant_id';
```
Replace client_id, client_secret, and tenant_id with the appropriate values from your external object storage.

Use the authorization object in your READ_NOS command:

SELECT * FROM READ_NOS (
USING
LOCATION('/AZ/azspstorage1.blob.core.windows.net/test-az-sp/ut.csv')
AUTHORIZATION(Asp_Simplified_Auth)
RETURNTYPE('NOSREAD_RECORD')
) AS d;

SCANPCT

[Optional] RETURNTYPE must be READNOS_RECORD (default) or READNOS_SCHEMA.

Percentage of data set to scan to discover the schema.

SCANPCT is enabled by default. To disable it, set DBControl option SCANPCTOption to 1. To re-enable it, set SCANPCTOption to 0 or 2.

CSV and JSON Files
scan_pct	Scanned
0 (default)	First 100 MB of data set (file by file if there are multiple files).
Between 0 and 1	First, first 100 MB of data set (file by file if there are multiple files). Then, scan_pct * 100% of first 1000 remaining files.
1	First 1000 files of data set.

Parquet Files
scan_pct	Scanned
Less than 1	First 100 MB of data set (file by file if there are multiple files).
1	First 16 MB of each file, up to 100 MB of data set.

A higher scan_pct may increase schema accuracy, but increases performance time.

BUFFERSIZE

Size of the network buffer to allocate when retrieving data from the external object storage repository.

Default: 16 MB (maximum value)

NOSREAD_RECORD

[Default] Returns one row for each external record with its metadata.

Access external records by specifying one of the following:

Input table and LOCATION and an empty table.
Input table with a row for each external file.

For an empty single-column input table, do the following:

Define an input table with a single column, Payload, with the appropriate data type:
- JSON
This column determines the output Payload column return type.

See Example: Using READ_NOS with NOSREAD_RECORD Return Type.
For LOCATION, specify the filepath.

For a multiple-column input table, define an input table with the following columns:

Location VARCHAR(2048) CHARACTER SET UNICODE
ObjectVersionID VARCHAR(1024) CHARACTER SET UNICODE
ObjectTimeStamp TIMESTAMP(6)
OffsetIntoObject BIGINT
ObjectLength BIGINT
Payload JSON

This table can be populated using the output of the NOSREAD_KEYS return type.

NOSREAD_KEYS

Retrieves the list of files from the path specified in the LOCATION USING clause.

A schema definition is not necessary.

Returns:

Location
ObjectVersionID
ObjectTimeStamp
ObjectLength, size of external file.

See Example: Using READ_NOS with NOSREAD_KEYS Return Type.

NOSREAD_PARQUET_SCHEMA

Returns information about the Parquet data schema. If you are using the FULLSCAN option, use NOSREAD_PARQUET_SCHEMA; otherwise you can use NOSREAD_SCHEMA to get information about the Parquet schema. For information about the mapping between Parquet data types and Teradata data types, see Parquet External Files.

NOSREAD_SCHEMA

Returns the name and data type of each column of the file specified by external_file_path. Schema format can be JSON, CSV, or Parquet.

If the schema format is Parquet, NOSREAD_SCHEMA is equivalent to NOSREAD_PARQUET_SCHEMA.

SAMPLE_PERC

Specifies the percentage of rows to retrieve from the external object storage repository when RETURNTYPE is NOSREAD_RECORD. The valid range of values is from '0.0' to '1.0', where '1.0' represents 100% of the rows.

The default value is 1.0.

STOREDAS

Specifies the formatting style of the external data.

PARQUET means the external data is formatted as Parquet. This is a required parameter for Parquet data.
TEXTFILE means the external data uses a text-based format, such as CSV or JSON.

The default is TEXTFILE.

FULLSCAN

Determines whether READ_NOS scans columns of variable length types (CHAR, VARCHAR, BYTE, VARBYTE, JSON, and BSON) to discover the maximum length.

TRUE means the sizes of variable length data is determined from the Parquet data.
Choosing this value can impact performance because all variable length data type columns in each Parquet file at the location must be scanned to assess the value having the greatest length.
FALSE means variable length field sizes are assigned the Vantage maximum value for the data type.

The default is FALSE.

FULLSCAN is only used with a RETURNTYPE of NOSREAD_PARQUET_SCHEMA.

MANIFEST

Specifies whether the LOCATION value points to a manifest file (a file containing a list of files to read) or object name. The object name can include the full path or a partial path, and must identify a single file containing the manifest.

The individual entries within the manifest file must show complete paths.

Following is an example of a manifest file that contains a list of entries to locations in JSON format.

{
  "entries": [
     {"url":"s3://nos-core-us-east-1/UNICODE/JSON/mln-key/data-10/data-8_9_02-10.json"},
     {"url":"s3://nos-core-us-east-1/UNICODE/JSON/mln-key/data-10/data-8_9_02-101.json"},
     {"url":"s3://nos-core-us-east-1/UNICODE/JSON/mln-key/data-10/data-10-01/data-8_9_02-102.json"},
     {"url":"s3://nos-core-us-east-1/UNICODE/JSON/mln-key/data-10/data-10-01/data-8_9_02-103.json"}
  ]
}

TABLE_FORMAT ('DELTALAKE')

[Optional] Specify this argument with MANIFEST ('TRUE') to create a Delta Lake table. READ_NOS operations on a Delta Lake table read the table manifest file.

The LOCATION value of a Delta Lake table must include _symlink_format_manifest. For example:

LOCATION('/S3/YOUR-BUCKET.s3.amazonaws.com/testdeltalake/deltalakewp/_symlink_format_manifest')

LOCATION('/S3/YOUR-BUCKET.s3.amazonaws.com/testdeltalake/deltalakewp/_symlink_format_manifest/zip=95661/manifest')

ROWFORMAT

Specifies the encoding format of the external row, for example:

ROWFORMAT('{"field_delimiter":",", "record_delimiter":"\n", "character_set":"LATIN"}')

Specify ROWFORMAT using JSON format. ROWFORMAT can include only the three preceding keys. Key names and values are case-specific, except for the value for "character_set", which can use any combination of letter cases.

The ROWFORMAT character set specification must be compatible with character set of the Payload column.

Do not specify ROWFORMAT for Parquet format data.

For a JSON column, these are the default values:

Payload Character Set	Default Definition
UNICODE	ROWFORMAT('{"record_delimiter":"\n", "character_set":"UTF8"}')
LATIN	ROWFORMAT('{"record_delimiter":"\n", "character_set":"LATIN"}')

For a CSV column, these are the default values:

Payload Character Set	Default Definition
UNICODE	ROWFORMAT('{"character_set":"UTF8"}') This is the default if you do not specify an input table for READ_NOS.
LATIN	ROWFORMAT('{"character_set":"LATIN"}')

Payload Character Set

Default Definition

UNICODE

ROWFORMAT('{"character_set":"UTF8"}')

This is the default if you do not specify an input table for READ_NOS.

LATIN

ROWFORMAT('{"character_set":"LATIN"}')

You can specify the following options:

field_delimiter: The default is "," (comma). You can also specify a custom field delimiter, such as tab "\t".
record_delimiter: New line feed character: "\n". A line feed (\n) is the only acceptable record delimiter.
character_set: "UTF8" or "LATIN"; If you do not specify a ROWFORMAT or payload column, Vantage assumes UTF8 Unicode.

HEADER

Specifies whether the first row of data in an input CSV file is interpreted as column headings for the subsequent rows of data.

The value for HEADER can be 'TRUE' or 'FALSE'. The default is 'TRUE'.