Teradata Package for R Function Reference | 17.20 - ReadNOS - Teradata Package for R - Look here for syntax, methods and examples for the functions included in the Teradata Package for R.

Teradata® Package for R Function Reference

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Teradata Package for R
Release Number
17.20
Published
March 2024
ft:locale
en-US
ft:lastEdition
2024-05-03
dita:id
TeradataR_FxRef_Enterprise_1720
lifecycle
latest
Product Category
Teradata Vantage

ReadNOS

Description

td_read_nos_sqle() function enables access to external files in JSON , CSV, or Parquet format.
User connected to Vanatge must have must have the EXECUTE FUNCTION privilege on TD_SYSFNLIB.READ_NOS.

Usage

  td_read_nos_sqle (
      data = NULL,
      location = NULL,
      authorization = NULL,
      return.type = "NOSREAD_RECORD",
      sample.perc = 1.0,
      stored.as = "TEXTFILE",
      scan.pct = NULL,
      manifest = FALSE,
      table.format = NULL,
      row.format = NULL,
      header = TRUE,
      ...
  )

Arguments

data

Optional Argument.
Specifies the input tbl_teradata.
Types: tbl_teradata

location

Optional Argument.
Specifies the location value, which is a Uniform Resource Identifier (URI) pointing to the data in the external object storage system.
The location value includes the following components:

  • Amazon S3:
    "/connector/bucket.endpoint/[key_prefix]".

  • Azure Blob storage and Azure Data Lake Storage Gen2:
    "/connector/container.endpoint/[key_prefix]".

  • Google Cloud Storage:
    "/connector/endpoint/bucket/[key_prefix]".

The following fields explain each component of location value:

  • connector:
    Identifies the type of external storage system where the data is located.
    Teradata requires the storage location to start with the following for all external storage locations:

    • Amazon S3 storage location must begin with /S3 or /s3

    • Azure Blob storage location (including Azure Data Lake Storage Gen2 in Blob Interop Mode) must begin with /AZ or /az.

    • Google Cloud Storage location must begin with /GS or /gs.

  • storage-account:
    Used by Azure. The Azure storage account contains your Azure storage data objects.

  • endpoint:
    A URL that identifies the system-specific entry point for the external object storage system.

  • bucket (Amazon S3, Google Cloud Storage) or container (Azure Blob storage and Azure Data Lake Storage Gen2):
    A container that logically groups stored objects in the external storage system.

  • key_prefix:
    Identifies one or more objects in the logical organization of the bucket data. Because it is a key prefix, not an actual directory path, the key prefix may match one or more objects in the external storage. For example, the key prefix "/fabrics/cotton/colors/b/" would match objects "/fabrics/cotton/colors/blue", "/fabrics/cotton/colors/brown", and "/fabrics/cotton/colors/black". If there are organization levels below those, such as "/fabrics/cotton/colors/blue/shirts", the same key prefix would gather those objects too.
    Note:

    • Vantage validates only the first file it encounters from the location key prefix.

  • For example, this location value might specify all objects on an Amazon cloud storage system for the month of December, 2001:
    location = "/S3/YOUR-BUCKET.s3.amazonaws.com/csv/US-Crimes/csv-files/2001/Dec/"

    • connector: S3

    • bucket: YOUR-BUCKET

    • endpoint: s3.amazonaws.com

    • key_prefix: csv/US-Crimes/csv-files/2001/Dec/

  • Following location could specify an individual storage object (or file), Day1.csv:
    location = "/S3/YOUR-BUCKET.s3.amazonaws.com/csv/US-Crimes/csv-files/2001/Dec/Day1.csv"

    • connector: S3

    • bucket: YOUR-BUCKET

    • endpoint: s3.amazonaws.com

    • key_prefix: csv/US-Crimes/csv-files/2001/Dec/Day1.csv

  • Following location specifies an entire container in an Azure external object store (Azure Blob storage or Azure Data Lake Storage Gen2).
    The container may contain multiple file objects:
    location = "/AZ/YOUR-STORAGE-ACCOUNT.blob.core.windows.net/nos-csv-data"

    • connector: AZ

    • bucket: YOUR-STORAGE-ACCOUNT

    • endpoint: blob.core.windows.net

    • key_prefix: nos-csv-data

  • This is an example of a Google Cloud Storage location:
    location = "/gs/storage.googleapis.com/YOUR-BUCKET/CSVDATA/RIVERS/rivers.csv"

    • connector: GS

    • bucket: YOUR-BUCKET

    • endpoint: storage.googleapis.com,

    • key_prefix: CSVDATA/RIVERS/rivers.csv

Types: character

authorization

Optional Argument.
Specifies the authorization for accessing the external storage.
Following are the options for using "authorization":

  • Passing access id and key as dictionary: authorization = list('Access_ID'= 'access_id', 'Access_Key'= 'secret_key')

  • Passing access id and key as string in JSON format: authorization = '{"Access_ID":"access_id", "Access_Key":"secret_key"}'

  • Using Authorization Object: On any platform, authorization object can be specified as authorization = "[DatabaseName.]AuthorizationObjectName"
    EXECUTE privilege on AuthorizationObjectName is needed.

Below table specifies the USER (identification) and PASSWORD (secret_key) for accessing external storage. The following table shows the supported credentials for USER and PASSWORD (used in the CREATE AUTHORIZATION SQL command):

-------------------------------------------------------------------
System/Scheme USER PASSWORD
-------------------------------------------------------------------
AWS Access Key ID Access Key Secret
Azure / Shared Key Storage Account Name Storage Account Key
Azure Shared Access Signature (SAS) Storage Account Name Account SAS Token
Google Cloud (S3 interop mode) Access Key ID Access Key Secret |
Google Cloud (native) Client Email Private Key
Public access object stores <empty string> Enclose the empty string in single straight quotes: USER '' <empty string> Enclose the empty string in single straight quotes: PASSWORD ''
---------------------------------------------------------------------------

When accessing GCS, Analytics Database uses either the S3-compatible connector or the native Google connector, depending on the user credentials.
Notes:

  • If using AWS IAM credentials, "authorization" can be omitted.

  • If S3 user account requires the use of physical or virtual security, session token can be used with Access_ID and Access_Key in this syntax:
    authorization = list("Access_ID" = "access_id", "Access_Key" = "secret_key", "Session_Token" = "session_token")
    In which, the session token can be obtained using the AWS CLI. For example: "aws sts get-session-token"

Types: character or list

return.type

Optional Argument.
Specifies the format in which data is returned.
Permitted Values:

  • NOSREAD_RECORD: Returns one row for each external record along with its metadata.
    Access external records by specifying one of the following:

    • Input tbl_teradata, location, and tbl_teradata on an empty table. For CSV, you can include a schema definition.

    • Input tbl_teradata with a row for each external file. For CSV, this method does not support a schema definition.

    For an empty single-column input table, do the following:

    • Define an input tbl_teradata with a single column, Payload, with the appropriate data type:
      JSON and CSV

    • This column determines the output Payload column return type.
      Specify the filepath in the "location" argument.

    For a multiple-column input table, define a tbl_teradata with the following columns:

    -------------------------------------------------------
    Column Name Data Types
    -------------------------------------------------------
    Location VARCHAR(2048) CHARACTER SET UNICODE
    ObjectVersionID VARCHAR(1024) CHARACTER SET UNICODE
    OffsetIntoObject BIGINT
    ObjectLength BIGINT
    Payload JSON
    CSV VARCHAR
    -------------------------------------------------------

    This tbl_teradata can be populated using the output of the 'NOSREAD_KEYS' return type.

  • NOSREAD_KEYS: Retrieve the list of files from the path specified in the "location" argument.
    A schema definition is not necessary.
    'NOSREAD_KEYS' returns Location, ObjectVersionID, ObjectTimeStamp, and ObjectLength (size of external file).

  • NOSREAD_PARQUET_SCHEMA: Returns information about the Parquet data schema.
    For information about the mapping between Parquet data types and Teradata data types, see Parquet External Files in Teradata Vantage - SQL Data Definition Language Syntax and Examples.

  • NOSREAD_SCHEMA: Returns the name and data type of each column of the file specified in the "location" argument. Schema format can be JSON, CSV, or Parquet.

Default Value: "NOSREAD_RECORD"
Types: character

sample.perc

Optional Argument.
Specifies the percentage of rows to retrieve from the external storage repository when "return.type" is 'NOSREAD_RECORD'. The valid range of values is from 0.0 to 1.0, where 1.0 represents 100% of the rows.
Default Value: 1.0
Types: float

stored.as

Optional Argument.
Specifies the formatting style of the external data.
Permitted Values:

  • PARQUET- The external data is formatted as Parquet. This is a required parameter for Parquet data.

  • TEXTFILE- The external data uses a text-based format, such as CSV or JSON.

Default Value: "TEXTFILE"
Types: character

scan.pct

Optional Argument.
Specifies the percentage of "data" to be scanned to discover the schema.
Value must be greater than or equal to 0 and less than or equal to 1.

  • For CSV and JSON files,

    -------------------------------------------------------------------------------------------
    scan.pct Scanned
    ---------------------------------------------------------------------------------------------
    0 (default) First 100 MB of data set (file by file if there are multiple files).
    Between 0 and 1 First, first 100 MB of data set (file by file if there are multiple files).Then, scan_pct * 100% of first 1000 remaining files.
    1 First 1000 files of data set.
    -------------------------------------------------------------------------------------------
  • For Parquet files, when "scan.pct" is set to:

    -------------------------------------------------------------------------------
    scan_pct Scanned
    -------------------------------------------------------------------------------
    Less than 1 First 100 MB of data set (file by file if there are multiple files).
    1 First 16 MB of each file, up to 100 MB of data set.
    ----------------------------------------------------------------------------------

Types: float

manifest

Optional Argument.
Specifies whether the location value points to a manifest file (a file containing a list of files to read) or object name. The object name can include the full path or a partial path. It must identify a single file containing the manifest.
Note:

  • Individual entries within the manifest file must show complete paths.

Below is an example of a manifest file that contains a list of locations in JSON format.
{ "entries": [ {"url":"s3://nos-core-us-east-1/UNICODE/JSON/mln-key/data-10/data-8_9_02-10.json"}, {"url":"s3://nos-core-us-east-1/UNICODE/JSON/mln-key/data-10/data-8_9_02-101.json"}, {"url":"s3://nos-core-us-east-1/UNICODE/JSON/mln-key/data-10/data-10-01/data-8_9_02-102.json"}, {"url":"s3://nos-core-us-east-1/UNICODE/JSON/mln-key/data-10/data-10-01/data-8_9_02-103.json"} ] }
Default Value: FALSE
Types: logical

table.format

Optional Argument.
Specifies the format of the tables specified in manifest file.
Note:

  • "manifest" must be set to TRUE.

  • "location": value must include "_symlink_format_manifest".
    For example:
    "/S3/YOUR-BUCKET.s3.amazonaws.com/testdeltalake/deltalakewp/_symlink_format_manifest"
    "/S3/YOUR-BUCKET.s3.amazonaws.com/testdeltalake/deltalakewp/_symlink_format_manifest/zip=95661/manifest"

Types: character

row.format

Optional Argument.
Specifies the encoding format of the external row.

  • For example:
    row.format = list('field_delimiter'= ',', 'record_delimiter'= '\n', 'character_set'= 'LATIN')

  • If string value is used, JSON format must be used to specify the row format.
    For example:
    row.format = '{"field_delimiter": ",", "record_delimiter": "\n", "character_set": "LATIN"}'

Format can include only the three keys shown above. Key names and values are case-specific, except for the value for "character_set", which can use any combination of letter cases.
The character set specified in "row.format" must be compatible with the character set of the Payload column.
Do not specify "row.format" for Parquet format data.
For a JSON column, these are the default values:

  • UNICODE: row.format = {"record_delimiter":"\n", "character_set":"UTF8"}

  • LATIN: row.format = {"record_delimiter":"\n", "character_set":"LATIN"}

For a CSV column, these are the default values:

  • UNICODE: row.format = '{"character_set":"UTF8"}'

  • LATIN: row.format = '{"character_set":"LATIN"}'

User can specify the following options:

  • field_delimiter: The default is "," (comma). User can also specify a custom field delimiter, such as tab "\t".

  • record_delimiter: New line feed character: "\n". A line feed is the only acceptable record delimiter.

  • character_set: "UTF8" or "LATIN". If you do not specify a "row.format" or payload column, Vantage assumes UTF8 Unicode.

Types: character or list

header

Optional Argument.
Specifies whether the first row of data in an input CSV file is interpreted as column headings for the subsequent rows of data. Use this parameter only when a CSV input file is not associated with a separate schema object that defines columns for the CSV data.
Default Value: TRUE
Types: logical

...

Specifies the generic keyword arguments SQLE functions accept.
Below are the generic keyword arguments:

persist:
Optional Argument.
Specifies whether to persist the results of the function in table or not.
When set to TRUE, results are persisted in table; otherwise, results are garbage collected at the end of the session.
Default Value: FALSE
Types: logical

volatile:
Optional Argument.
Specifies whether to put the results of the function in volatile table or not.
When set to TRUE, results are stored in volatile table, otherwise not.
Default Value: FALSE
Types: logical

Function allows the user to partition, hash, order or local order the input data. These generic arguments are available for each argument that accepts tbl_teradata as input and can be accessed as:

  • "<input.data.arg.name>.partition.column" accepts character or vector of character (Strings)

  • "<input.data.arg.name>.hash.column" accepts character or vector of character (Strings)

  • "<input.data.arg.name>.order.column" accepts character or vector of character (Strings)

  • "local.order.<input.data.arg.name>" accepts logical

Note:

  • These generic arguments are supported by tdplyr if the underlying Analytic Database function supports, else an exception is raised.

Value

Function returns an object of class "td_read_nos_sqle" which is a named list containing object of class "tbl_teradata".
Named list member(s) can be referenced directly with the "$" operator using the name(s):result

Examples

  
    
    # Get the current context/connection.
    con <- td_get_context()$connection
    
    # Check the list of available analytic functions.
    display_analytic_functions()
    
    # Example 1: Reading PARQUET file from AWS S3 location.
    obj <-  td_read_nos_sqle(
              location="/S3/s3.amazonaws.com/Your-Bucket/location/",
              authorization=list("Access_ID"= "YOUR-ID", "Access_Key"= "YOUR-KEY"),
              stored.as="PARQUET")
    
    # print the result.
    print(obj$result)
    
    # Example 2: Read PARQUET file in external storage with one row for each external 
    #            record along with its metadata.
    obj <- td_read_nos_sqle(
            location="/S3/s3.amazonaws.com/Your-Bucket/location/",
            authorization=list("Access_ID"= "YOUR-ID", "Access_Key"= "YOUR-KEY"),
            return.type="NOSREAD_KEYS")
    
    # print the result.
    print(obj$result)
    
    # Example 3: Read CSV file from external storage.
    obj <- td_read_nos_sqle(
            location="/S3/s3.amazonaws.com/Your-Bucket/csv-location/",
            authorization=list("Access_ID"= "YOUR-ID", "Access_Key"= "YOUR-KEY"),
            stored.as="TEXTFILE")
    
    # print the result.
    print(obj$result)
    
    # Example 4: Read CSV file in external storage using "data" argument.
    # Create a table to store the data.
    query <- "CREATE TABLE read_nos_support_tab 
              (payload dataset storage format csv) NO PRIMARY INDEX;"
    dbExecute(con, query)
    
    read_nos_support_tab = tbl(con, "read_nos_support_tab")
    
    # Read the CSV data using "data" argument.
    obj <- td_read_nos_sqle(
            data=read_nos_support_tab,
            location="/S3/s3.amazonaws.com/Your-Bucket/csv-location/",
            authorization=list("Access_ID"= "YOUR-ID", "Access_Key"= "YOUR-KEY"),
            row.format=list("field_delimiter"= ",",
                            "record_delimiter"= "\n",
                            "character_set"= "LATIN"),
            stored.as="TEXTFILE")
    
    # print the result.
    print(obj$result)
    
    # Note: 
    #   Before proceeding, verify with your database administrator that you  
    #   have the correct privileges, an authorization object, 
    #   and a function mapping(for READ_NOS function).
    
    # If function mapping for READ_NOS Analytic database function is created 
    # as 'READ_NOS_FM' and location and authorization object are mapped,
    # then set function mapping with tdplyr options as below.
    
    # Example 5: Setting function mapping using configuration options.
    options(read.nos.function.mapping = "READ_NOS_FM")
    obj <-  td_read_nos_sqle(
              data=read_nos_support_tab,
              row.format=list("field_delimiter"= ",",
                              "record_delimiter"= "\n",
                              "character_set"= "LATIN"),
              stored.as="TEXTFILE")
    
    # print the result.
    print(obj$result)