Data in JSON or CSV Format | Native Object Store | Teradata R Package - Data in JSON or CSV Format - Teradata Package for R

Teradata® Package for R User Guide

Product
Teradata Package for R
Release Number
17.00
Published
July 2021
Language
English (United States)
Last Update
2023-08-08
dita:mapPath
yih1585763700215.ditamap
dita:ditavalPath
ayr1485454803741.ditaval
dita:id
B700-4005
Product Category
Teradata Vantage
Foreign table created in JSON or comma-separated values (CSV) format usually contains two columns:
  • Location
  • Payload

User can create an object of class "tbl_teradata" on a foreign table using "tbl()", similar to creating an object of class "tbl_teradata" on a regular table. With the created tbl_teradata, user can easily access and process the data using various API's offered by tdplyr and other R packages that work with tbl objects.

The following sections show how to create tdplyr tbl_teradata object on NOS foreign tables.

Create tdplyr tbl_teradata object on NOS foreign table in JSON data format

Assume that the following foreign table has been created on JSON data in Amazon S3 bucket:
CREATE MULTISET FOREIGN TABLE ALICE.riverflow ,FALLBACK ,
     EXTERNAL SECURITY DEFINER TRUSTED DOCDEF_AUTH ,
     MAP = TD_MAP1
     (
      Location VARCHAR(2048) CHARACTER SET UNICODE CASESPECIFIC,
      Payload JSON(8388096) INLINE LENGTH 32000 CHARACTER SET UNICODE)
USING
(
      LOCATION  ('/s3/s3.amazonaws.com/td-usgs-compressed/DATA/')
      PATHPATTERN  ('$data/$siteno/$year/$month/$day')
);
  • Create an object of class "tbl_teradata" on foreign table.
    # Create object(s) of class "tbl_teradata" on foreign table which contains data in JSON format on S3
    > riverflow <- tbl(con, "riverflow")
  • Display the structure of the tbl_teradata.
    # Get the structure of tbl_teradata
    > str(riverflow)
    List of 3
     $ src  :List of 2
      ..$ con  :Formal class 'Teradata' [package ".GlobalEnv"] with 6 slots
      .. .. ..@ m_uLog       :integer64 0 
      .. .. ..@ m_bImmediate : logi TRUE
      .. .. ..@ m_bTraceLog  : logi FALSE
      .. .. ..@ m_bDebugLog  : logi FALSE
      .. .. ..@ m_uConnHandle:integer64 1 
      .. .. ..@ m_envMutable :<environment: 0x00000000205adf18> 
      ..$ disco: NULL
      ..- attr(*, "class")= chr [1:4] "src_Teradata" "src_dbi" "src_sql" "src"
     $ ops  :List of 2
      ..$ x   : 'ident' chr "riverflow"
      ..$ vars: chr [1:2] "Location" "Payload"
      ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
     $ types: Named chr [1:2] "varchar" "json"
      ..- attr(*, "names")= chr [1:2] "CV" "JN"
     - attr(*, "class")= chr [1:5] "tbl_teradata" "tbl_dbi" "tbl_sql" "tbl_lazy" ...
  • Display the columns in the tbl_teradata.
    # As seen from create table statement, table has two columns 'Location' and 'Payload'
    > colnames(riverflow)
    'Location' 'Payload'
  • Check the data types of the columns.
    # Data type check
    > riverflow$types
    CV
    'varchar'
    JN
    'json'
  • Print the content of the table.
    # Print the tbl_teradata.
    # Equivalent SQL: 
    #       SELECT TOP 10 * FROM riverflow
    > as.data.frame(head(riverflow, 10))
    A data.frame: 10 × 2
    Location	Payload
    <chr>	<chr>
    /S3/s3.amazonaws.com/td-usgs-compressed/DATA/09423560/2018/06/29.gz	{ "site_no":"09423560", "datetime":"2018-06-29 00:30", "Flow":"122", "WaterVelocity":"2.00", "Temp":"71.2", "GageHeight":"8.05"}
    /S3/s3.amazonaws.com/td-usgs-compressed/DATA/09423560/2018/06/29.gz	{ "site_no":"09423560", "datetime":"2018-06-29 01:00", "Flow":"125", "WaterVelocity":"2.01", "Temp":"71.0", "GageHeight":"8.09"}
    /S3/s3.amazonaws.com/td-usgs-compressed/DATA/09423560/2018/06/29.gz	{ "site_no":"09423560", "datetime":"2018-06-29 01:15", "Flow":"128", "WaterVelocity":"2.03", "Temp":"70.8", "GageHeight":"8.14"}
    /S3/s3.amazonaws.com/td-usgs-compressed/DATA/09423560/2018/06/29.gz	{ "site_no":"09423560", "datetime":"2018-06-29 01:30", "Flow":"129", "WaterVelocity":"2.04", "Temp":"70.7", "GageHeight":"8.15"}
    /S3/s3.amazonaws.com/td-usgs-compressed/DATA/09423560/2018/06/29.gz	{ "site_no":"09423560", "datetime":"2018-06-29 02:00", "Flow":"129", "WaterVelocity":"2.02", "Temp":"70.3", "GageHeight":"8.18"}
    /S3/s3.amazonaws.com/td-usgs-compressed/DATA/09423560/2018/06/29.gz	{ "site_no":"09423560", "datetime":"2018-06-29 02:15", "Flow":"128", "WaterVelocity":"1.99", "Temp":"70.2", "GageHeight":"8.20"}
    /S3/s3.amazonaws.com/td-usgs-compressed/DATA/09423560/2018/06/29.gz	{ "site_no":"09423560", "datetime":"2018-06-29 01:45", "Flow":"131", "WaterVelocity":"2.05", "Temp":"70.5", "GageHeight":"8.17"}
    /S3/s3.amazonaws.com/td-usgs-compressed/DATA/09423560/2018/06/29.gz	{ "site_no":"09423560", "datetime":"2018-06-29 00:45", "Flow":"122", "WaterVelocity":"1.98", "Temp":"71.1", "GageHeight":"8.08"}
    /S3/s3.amazonaws.com/td-usgs-compressed/DATA/09423560/2018/06/29.gz	{ "site_no":"09423560", "datetime":"2018-06-29 00:15", "Flow":"119", "WaterVelocity":"1.97", "Temp":"71.3", "GageHeight":"8.01"}
    /S3/s3.amazonaws.com/td-usgs-compressed/DATA/09423560/2018/06/29.gz	{ "site_no":"09423560", "datetime":"2018-06-29 00:00", "Flow":"115", "WaterVelocity":"1.92", "Temp":"71.4", "GageHeight":"7.98"}
    

Create tdplyr tbl_teradata object on NOS foreign table in CSV data format

Assume that the following foreign table has been created on CSV data in Amazon S3 bucket:
CREATE FOREIGN TABLE riverflowcsv
, EXTERNAL SECURITY DEFINER TRUSTED DOCDEF_AUTH
(
  Location VARCHAR(2048) CHARACTER SET UNICODE CASESPECIFIC,
  PAYLOAD DATASET INLINE LENGTH 64000 STORAGE FORMAT CSV)
USING
(
  LOCATION('/s3/s3.amazonaws.com/td-usgs-us-east-1/CSVDATA/')
  PATHPATTERN  ('$data/$siteno/$year/$month/$day')
);
  • Create an object of class "tbl_teradata" on foreign table.
    # Create object(s) of class "tbl_teradata" on foreign table which contains data in CSV format on S3
    > riverflow <- tbl(con, "riverflowcsv")
  • Display the structure of the tbl_teradata.
    # Get the structure of tbl_teradata
    > str(riverflow)
    List of 3
     $ src  :List of 2
      ..$ con  :Formal class 'Teradata' [package ".GlobalEnv"] with 6 slots
      .. .. ..@ m_uLog       :integer64 0 
      .. .. ..@ m_bImmediate : logi TRUE
      .. .. ..@ m_bTraceLog  : logi FALSE
      .. .. ..@ m_bDebugLog  : logi FALSE
      .. .. ..@ m_uConnHandle:integer64 1 
      .. .. ..@ m_envMutable :<environment: 0x00000000205ae8f0> 
      ..$ disco: NULL
      ..- attr(*, "class")= chr [1:4] "src_Teradata" "src_dbi" "src_sql" "src"
     $ ops  :List of 2
      ..$ x   : 'ident' chr "riverflowcsv"
      ..$ vars: chr [1:2] "Location" "PAYLOAD"
      ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
     $ types: Named chr [1:2] "varchar" "dataset"
      ..- attr(*, "names")= chr [1:2] "CV" "DT"
     - attr(*, "class")= chr [1:5] "tbl_teradata" "tbl_dbi" "tbl_sql" "tbl_lazy" ...
  • Display the columns in the tbl_teradata.
    # As seen from create table statement, table has two columns 'Location' and 'Payload'
    > colnames(riverflow)
    'Location''PAYLOAD'
  • Check the data types of the columns.
    # Data type check
    > riverflow$types
    CV
    'varchar'
    DT
    'dataset'
  • Print the content of the table.
    # Print the tbl_teradata.
    # Equivalent SQL: 
    #      SELECT TOP 10 * FROM riverflow
    > as.data.frame(head(riverflow, 10))
    A data.frame: 10 × 2
    Location	PAYLOAD
    <chr>	<chr>
    /S3/s3.amazonaws.com/td-usgs-us-east-1/CSVDATA/09396100/2018/07/14.csv	GageHeight2,Flow,site_no,datetime,Precipitation,GageHeight 2.05,186,09396100,2018-07-14 00:15,0.00,2.05
    /S3/s3.amazonaws.com/td-usgs-us-east-1/CSVDATA/09396100/2018/07/14.csv	GageHeight2,Flow,site_no,datetime,Precipitation,GageHeight 3.58,1320,09396100,2018-07-14 00:30,0.00,3.58
    /S3/s3.amazonaws.com/td-usgs-us-east-1/CSVDATA/09396100/2018/07/14.csv	GageHeight2,Flow,site_no,datetime,Precipitation,GageHeight 4.20,1940,09396100,2018-07-14 00:44,,4.20
    /S3/s3.amazonaws.com/td-usgs-us-east-1/CSVDATA/09396100/2018/07/14.csv	GageHeight2,Flow,site_no,datetime,Precipitation,GageHeight 4.20,1940,09396100,2018-07-14 00:45,0.00,4.20
    /S3/s3.amazonaws.com/td-usgs-us-east-1/CSVDATA/09396100/2018/07/14.csv	GageHeight2,Flow,site_no,datetime,Precipitation,GageHeight 4.18,1920,09396100,2018-07-14 01:00,0.00,4.18
    /S3/s3.amazonaws.com/td-usgs-us-east-1/CSVDATA/09396100/2018/07/14.csv	GageHeight2,Flow,site_no,datetime,Precipitation,GageHeight 3.77,1500,09396100,2018-07-14 01:14,,3.77
    /S3/s3.amazonaws.com/td-usgs-us-east-1/CSVDATA/09396100/2018/07/14.csv	GageHeight2,Flow,site_no,datetime,Precipitation,GageHeight 4.18,1920,09396100,2018-07-14 00:59,,4.18
    /S3/s3.amazonaws.com/td-usgs-us-east-1/CSVDATA/09396100/2018/07/14.csv	GageHeight2,Flow,site_no,datetime,Precipitation,GageHeight 3.58,1320,09396100,2018-07-14 00:29,,3.58
    /S3/s3.amazonaws.com/td-usgs-us-east-1/CSVDATA/09396100/2018/07/14.csv	GageHeight2,Flow,site_no,datetime,Precipitation,GageHeight 2.05,186,09396100,2018-07-14 00:14,,2.05
    /S3/s3.amazonaws.com/td-usgs-us-east-1/CSVDATA/09396100/2018/07/14.csv	GageHeight2,Flow,site_no,datetime,Precipitation,GageHeight 2.16,232,09396100,2018-07-14 00:00,0.00,2.16
    

How to access actual data and path variables

Though tdplyr provides direct access to the data in foreign tables, actual data that resides in 'Payload' column and path variables for the foreign table can be accessed in one of the following ways: