Accessing Foreign Table Created On Parquet Data | NOS | tdplyr - Accessing Foreign Table Created On Parquet Data

Assume that the following foreign table has been created on Parquet data in Amazon S3 bucket:

CREATE FOREIGN TABLE riverflow_parquet
, EXTERNAL SECURITY DEFINER TRUSTED DOCDEF_AUTH
(
  Location VARCHAR(2048) CHARACTER SET UNICODE CASESPECIFIC
  , GageHeight2 DOUBLE PRECISION FORMAT '-ZZZ9.99'
  , Flow DOUBLE PRECISION FORMAT '-ZZZZ9.99'
  , site_no BIGINT
  , datetime VARCHAR(16) CHARACTER SET UNICODE CASESPECIFIC
  , Precipitation DOUBLE PRECISION FORMAT '-ZZZ9.99'
  , GageHeight DOUBLE PRECISION FORMAT '-ZZZ9.99'
)
USING (
    LOCATION ('/s3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/')
    STOREDAS ('PARQUET')
) NO PRIMARY INDEX
, PARTITION BY COLUMN ;

The following examples show how to create a tdplyr tbl_teradata on this table and process this tbl_teradata with other tdplyr API's.

Create an object of class "tbl_teradta"on the foreign table

# Create object(s) of class "tbl_teradata" on foreign table which contains data in PARQUET format on S3
> riverflow <- tbl(con, "riverflow_parquet")

Check Properties of the tbl_teradata

Display the structure of the tbl_teradata.

# Get the structure of tbl_teradata
> str(riverflow)

List of 3
 $ src  :List of 2
  ..$ con  :Formal class 'Teradata' [package ".GlobalEnv"] with 6 slots
  .. .. ..@ m_uLog       :integer64 0 
  .. .. ..@ m_bImmediate : logi TRUE
  .. .. ..@ m_bTraceLog  : logi FALSE
  .. .. ..@ m_bDebugLog  : logi FALSE
  .. .. ..@ m_uConnHandle:integer64 1 
  .. .. ..@ m_envMutable :<environment: 0x00000000205ad8c0> 
  ..$ disco: NULL
  ..- attr(*, "class")= chr [1:4] "src_Teradata" "src_dbi" "src_sql" "src"
 $ ops  :List of 2
  ..$ x   : 'ident' chr "riverflow_parquet"
  ..$ vars: chr [1:7] "Location" "GageHeight2" "Flow" "site_no" ...
  ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
 $ types: Named chr [1:7] "varchar" "float" "float" "bigint" ...
  ..- attr(*, "names")= chr [1:7] "CV" "F" "F" "I8" ...
 - attr(*, "class")= chr [1:5] "tbl_teradata" "tbl_dbi" "tbl_sql" "tbl_lazy" ...

Display the columns in the tbl_teradata.

# As seen from create table statement, table has two columns 'Location' and 'PAYLOAD'
> colnames(riverflow)

'Location''GageHeight2''Flow''site_no''datetime''Precipitation''GageHeight'

Check the types of the columns.

# Data type check
> riverflow$types

CV
'varchar'
F
'float'
F
'float'
I8
'bigint'
CV
'varchar'
F
'float'
F
'float'

Print the tbl_teradata.

# Print the tbl_teradata.
# Equivalent SQL: 
#      SELECT TOP 10 * FROM riverflow_parquet
> as.data.frame(head(riverflow, 10))

A data.frame: 10 × 7
Location                                                                           GageHeight2 Flow  site_no        datetime	Precipitation GageHeight
<chr>	<dbl>	<dbl>	<int64>	<chr>	<dbl>	<dbl>
/S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/15.parquet	1.96	153	9396100	2018-07-15 00:30	0	   1.96
/S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/15.parquet	1.95	150	9396100	2018-07-15 01:00	0   	1.95
/S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/15.parquet	1.92	140	9396100	2018-07-15 01:15	0	   1.92
/S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/15.parquet	1.92	140	9396100	2018-07-15 01:30	0   	1.92
/S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/15.parquet	1.88	127	9396100	2018-07-15 02:00	0	   1.88
/S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/15.parquet	1.87	124	9396100	2018-07-15 02:15	0   	1.87
/S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/15.parquet	1.90	133	9396100	2018-07-15 01:45	0   	1.90
/S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/15.parquet	1.93	143	9396100	2018-07-15 00:45	0	   1.93
/S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/15.parquet	1.96	153	9396100	2018-07-15 00:15	0   	1.96
/S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/15.parquet	1.98	160	9396100	2018-07-15 00:00	0   	1.98

Explore Data

Select some columns from the tbl_teradata.

> as.data.frame(head(riverflow %>% select(GageHeight2, Flow, site_no, datetime, Precipitation, GageHeight), 10))

A data.frame: 10 × 6
GageHeight2 Flow	site_no	datetime	Precipitation GageHeight
<dbl>	<dbl>	<int64>	<chr>	<dbl>	<dbl>
1.37	44.5	9396100	2018-07-13 00:30	0	1.50
1.32	39.4	9396100	2018-07-13 01:00	0	1.46
1.31	44.5	9396100	2018-07-13 01:15	0	1.50
1.29	42.8	9396100	2018-07-13 01:30	0	1.49
1.24	26.6	9396100	2018-07-13 02:00	0	1.35
1.23	34.8	9396100	2018-07-13 02:15	0	1.43
1.30	34.8	9396100	2018-07-13 01:45	0	1.43
1.34	34.8	9396100	2018-07-13 00:45	0	1.43
1.36	46.3	9396100	2018-07-13 00:15	0	1.51
1.37	53.9	9396100	2018-07-13 00:00	0	1.56

Filter the data where Flow is equal to 44.5.

> as.data.frame(riverflow %>% filter(Flow == 44.5))

A data.frame: 9 × 7
Location                                                                           GageHeight2 Flow	site_no	datetime	Precipitation GageHeight
<chr>	<dbl>	<dbl>	<int64>	<chr>	<dbl>	<dbl>
/S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/20.parquet	1.58	44.5	9396100	2018-07-20 16:45	0	   1.5
/S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/13.parquet	1.37	44.5	9396100	2018-07-13 00:30	0	   1.5
/S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/19.parquet	1.46	44.5	9396100	2018-07-19 22:00	0   	1.5
/S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/13.parquet	1.31	44.5	9396100	2018-07-13 01:15	0   	1.5
/S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/23.parquet	1.23	44.5	9396100	2018-07-23 05:15	0   	1.5
/S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/21.parquet	1.24	44.5	9396100	2018-07-21 14:45	0   	1.5
/S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/23.parquet	1.23	44.5	9396100	2018-07-23 06:45	0   	1.5
/S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/21.parquet	1.24	44.5	9396100	2018-07-21 15:00	0   	1.5
/S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/21.parquet	1.24	44.5	9396100	2018-07-21 15:30	0   	1.5

Sample the data to get only 5 rows.

> as.data.frame(td_sample(df = riverflow, n = 5))

A data.frame: 5 × 8
Location	GageHeight2	Flow	site_no	datetime	Precipitation	GageHeight	sampleid
<chr>	<dbl>	<dbl>	<int64>	<chr>	<dbl>	<dbl>	<int>
/S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09474000/2018/07/24.parquet	6.65	0.55	9474000	2018-07-24 05:00	0	3.53	1
/S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09400568/2018/07/08.parquet	5.82	0.00	9400568	2018-07-08 11:30	0	6.51	1
/S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09513780/2018/07/20.parquet	-0.98	0.00	9513780	2018-07-20 04:15	0	-1.44	1
/S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09400568/2018/07/07.parquet	5.82	0.00	9400568	2018-07-07 12:30	0	6.51	1
/S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09513780/2018/07/14.parquet	-0.98	0.00	9513780	2018-07-14 18:00	0	-1.44	1

Process the data with aggregate functions.

# Calculate average of all the values in the columns "Flow", "Gageheight2", "site_no", "GageHeight".
> as.data.frame(riverflow %>% summarise(mean_Flow = mean(Flow), mean_GageHeight2 = mean(GageHeight2), mean_site_no = mean(site_no), mean_GageHeight = mean(GageHeight)))

A data.frame: 1 × 4
mean_Flow	mean_GageHeight2	mean_site_no	mean_GageHeight
<dbl>	<dbl>	<dbl>	<dbl>
23.66244	3.367212	9443942	2.750457

# Calculate average of all the values in the columns "Flow", "Gageheight2", grouped by the column "site_no".
> as.data.frame((riverflow %>% group_by(site_no)) %>% summarise(mean_Flow = mean(Flow), mean_GageHeight2 = mean(GageHeight2)))

A data.frame: 6 × 3
site_no	mean_Flow	mean_GageHeight2
<int64>	<dbl>	<dbl>
9474000	3.259117	6.6917280
9396100	54.192747	1.3378045
9513780	0.000000	-0.9813805
9400568	16.829861	6.1807917
9394500	24.844732	5.0242497
9497500	62.423138	1.4001197

Accessing Foreign Table Created On Parquet Data | NOS | tdplyr - Accessing Foreign Table Created On Parquet Data - Teradata Package for R

Teradata® Package for R User Guide

Create an object of class "tbl_teradta"on the foreign table

Check Properties of the tbl_teradata

Explore Data