Accessing Foreign Table Created On Parquet Data | NOS | tdplyr - 17.00 - Accessing Foreign Table Created On Parquet Data - Teradata R Package

Teradata® R Package User Guide

prodname
Teradata R Package
vrm_release
17.00
created_date
November 2020
category
User Guide
featnum
B700-4005-090K
Assume that the following foreign table has been created on Parquet data in Amazon S3 bucket:
CREATE FOREIGN TABLE riverflow_parquet
, EXTERNAL SECURITY DEFINER TRUSTED DOCDEF_AUTH
(
  Location VARCHAR(2048) CHARACTER SET UNICODE CASESPECIFIC
  , GageHeight2 DOUBLE PRECISION FORMAT '-ZZZ9.99'
  , Flow DOUBLE PRECISION FORMAT '-ZZZZ9.99'
  , site_no BIGINT
  , datetime VARCHAR(16) CHARACTER SET UNICODE CASESPECIFIC
  , Precipitation DOUBLE PRECISION FORMAT '-ZZZ9.99'
  , GageHeight DOUBLE PRECISION FORMAT '-ZZZ9.99'
)
USING (
    LOCATION ('/s3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/')
    STOREDAS ('PARQUET')
) NO PRIMARY INDEX
, PARTITION BY COLUMN ;

The following examples show how to create a tdplyr tbl_teradata on this table and process this tbl_teradata with other tdplyr API's.

Create an object of class "tbl_teradta"on the foreign table

# Create object(s) of class "tbl_teradata" on foreign table which contains data in PARQUET format on S3
> riverflow <- tbl(con, "riverflow_parquet")

Check Properties of the tbl_teradata

  • Display the structure of the tbl_teradata.
    # Get the structure of tbl_teradata
    > str(riverflow)
    List of 3
     $ src  :List of 2
      ..$ con  :Formal class 'Teradata' [package ".GlobalEnv"] with 6 slots
      .. .. ..@ m_uLog       :integer64 0 
      .. .. ..@ m_bImmediate : logi TRUE
      .. .. ..@ m_bTraceLog  : logi FALSE
      .. .. ..@ m_bDebugLog  : logi FALSE
      .. .. ..@ m_uConnHandle:integer64 1 
      .. .. ..@ m_envMutable :<environment: 0x00000000205ad8c0> 
      ..$ disco: NULL
      ..- attr(*, "class")= chr [1:4] "src_Teradata" "src_dbi" "src_sql" "src"
     $ ops  :List of 2
      ..$ x   : 'ident' chr "riverflow_parquet"
      ..$ vars: chr [1:7] "Location" "GageHeight2" "Flow" "site_no" ...
      ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
     $ types: Named chr [1:7] "varchar" "float" "float" "bigint" ...
      ..- attr(*, "names")= chr [1:7] "CV" "F" "F" "I8" ...
     - attr(*, "class")= chr [1:5] "tbl_teradata" "tbl_dbi" "tbl_sql" "tbl_lazy" ...
  • Display the columns in the tbl_teradata.
    # As seen from create table statement, table has two columns 'Location' and 'PAYLOAD'
    > colnames(riverflow)
    'Location''GageHeight2''Flow''site_no''datetime''Precipitation''GageHeight'
  • Check the types of the columns.
    # Data type check
    > riverflow$types
    CV
    'varchar'
    F
    'float'
    F
    'float'
    I8
    'bigint'
    CV
    'varchar'
    F
    'float'
    F
    'float'
  • Print the tbl_teradata.
    # Print the tbl_teradata.
    # Equivalent SQL: 
    #      SELECT TOP 10 * FROM riverflow_parquet
    > as.data.frame(head(riverflow, 10))
    A data.frame: 10 × 7
    Location                                                                           GageHeight2 Flow  site_no        datetime	Precipitation GageHeight
    <chr>	<dbl>	<dbl>	<int64>	<chr>	<dbl>	<dbl>
    /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/15.parquet	1.96	153	9396100	2018-07-15 00:30	0	   1.96
    /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/15.parquet	1.95	150	9396100	2018-07-15 01:00	0   	1.95
    /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/15.parquet	1.92	140	9396100	2018-07-15 01:15	0	   1.92
    /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/15.parquet	1.92	140	9396100	2018-07-15 01:30	0   	1.92
    /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/15.parquet	1.88	127	9396100	2018-07-15 02:00	0	   1.88
    /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/15.parquet	1.87	124	9396100	2018-07-15 02:15	0   	1.87
    /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/15.parquet	1.90	133	9396100	2018-07-15 01:45	0   	1.90
    /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/15.parquet	1.93	143	9396100	2018-07-15 00:45	0	   1.93
    /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/15.parquet	1.96	153	9396100	2018-07-15 00:15	0   	1.96
    /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/15.parquet	1.98	160	9396100	2018-07-15 00:00	0   	1.98
    

Explore Data

  • Select some columns from the tbl_teradata.
    > as.data.frame(head(riverflow %>% select(GageHeight2, Flow, site_no, datetime, Precipitation, GageHeight), 10))
    A data.frame: 10 × 6
    GageHeight2 Flow	site_no	datetime	Precipitation GageHeight
    <dbl>	<dbl>	<int64>	<chr>	<dbl>	<dbl>
    1.37	44.5	9396100	2018-07-13 00:30	0	1.50
    1.32	39.4	9396100	2018-07-13 01:00	0	1.46
    1.31	44.5	9396100	2018-07-13 01:15	0	1.50
    1.29	42.8	9396100	2018-07-13 01:30	0	1.49
    1.24	26.6	9396100	2018-07-13 02:00	0	1.35
    1.23	34.8	9396100	2018-07-13 02:15	0	1.43
    1.30	34.8	9396100	2018-07-13 01:45	0	1.43
    1.34	34.8	9396100	2018-07-13 00:45	0	1.43
    1.36	46.3	9396100	2018-07-13 00:15	0	1.51
    1.37	53.9	9396100	2018-07-13 00:00	0	1.56
    
  • Filter the data where Flow is equal to 44.5.
    > as.data.frame(riverflow %>% filter(Flow == 44.5))
    A data.frame: 9 × 7
    Location                                                                           GageHeight2 Flow	site_no	datetime	Precipitation GageHeight
    <chr>	<dbl>	<dbl>	<int64>	<chr>	<dbl>	<dbl>
    /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/20.parquet	1.58	44.5	9396100	2018-07-20 16:45	0	   1.5
    /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/13.parquet	1.37	44.5	9396100	2018-07-13 00:30	0	   1.5
    /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/19.parquet	1.46	44.5	9396100	2018-07-19 22:00	0   	1.5
    /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/13.parquet	1.31	44.5	9396100	2018-07-13 01:15	0   	1.5
    /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/23.parquet	1.23	44.5	9396100	2018-07-23 05:15	0   	1.5
    /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/21.parquet	1.24	44.5	9396100	2018-07-21 14:45	0   	1.5
    /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/23.parquet	1.23	44.5	9396100	2018-07-23 06:45	0   	1.5
    /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/21.parquet	1.24	44.5	9396100	2018-07-21 15:00	0   	1.5
    /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/21.parquet	1.24	44.5	9396100	2018-07-21 15:30	0   	1.5
    
  • Sample the data to get only 5 rows.
    > as.data.frame(td_sample(df = riverflow, n = 5))
    A data.frame: 5 × 8
    Location	GageHeight2	Flow	site_no	datetime	Precipitation	GageHeight	sampleid
    <chr>	<dbl>	<dbl>	<int64>	<chr>	<dbl>	<dbl>	<int>
    /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09474000/2018/07/24.parquet	6.65	0.55	9474000	2018-07-24 05:00	0	3.53	1
    /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09400568/2018/07/08.parquet	5.82	0.00	9400568	2018-07-08 11:30	0	6.51	1
    /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09513780/2018/07/20.parquet	-0.98	0.00	9513780	2018-07-20 04:15	0	-1.44	1
    /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09400568/2018/07/07.parquet	5.82	0.00	9400568	2018-07-07 12:30	0	6.51	1
    /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09513780/2018/07/14.parquet	-0.98	0.00	9513780	2018-07-14 18:00	0	-1.44	1
    
  • Process the data with aggregate functions.
    # Calculate average of all the values in the columns "Flow", "Gageheight2", "site_no", "GageHeight".
    > as.data.frame(riverflow %>% summarise(mean_Flow = mean(Flow), mean_GageHeight2 = mean(GageHeight2), mean_site_no = mean(site_no), mean_GageHeight = mean(GageHeight)))
    
    A data.frame: 1 × 4
    mean_Flow	mean_GageHeight2	mean_site_no	mean_GageHeight
    <dbl>	<dbl>	<dbl>	<dbl>
    23.66244	3.367212	9443942	2.750457
    
    # Calculate average of all the values in the columns "Flow", "Gageheight2", grouped by the column "site_no".
    > as.data.frame((riverflow %>% group_by(site_no)) %>% summarise(mean_Flow = mean(Flow), mean_GageHeight2 = mean(GageHeight2)))
    A data.frame: 6 × 3
    site_no	mean_Flow	mean_GageHeight2
    <int64>	<dbl>	<dbl>
    9474000	3.259117	6.6917280
    9396100	54.192747	1.3378045
    9513780	0.000000	-0.9813805
    9400568	16.829861	6.1807917
    9394500	24.844732	5.0242497
    9497500	62.423138	1.4001197