Assume that the following foreign table has been created on Parquet data in Amazon S3 bucket:
CREATE FOREIGN TABLE riverflow_parquet , EXTERNAL SECURITY DEFINER TRUSTED DOCDEF_AUTH ( Location VARCHAR(2048) CHARACTER SET UNICODE CASESPECIFIC , GageHeight2 DOUBLE PRECISION FORMAT '-ZZZ9.99' , Flow DOUBLE PRECISION FORMAT '-ZZZZ9.99' , site_no BIGINT , datetime VARCHAR(16) CHARACTER SET UNICODE CASESPECIFIC , Precipitation DOUBLE PRECISION FORMAT '-ZZZ9.99' , GageHeight DOUBLE PRECISION FORMAT '-ZZZ9.99' ) USING ( LOCATION ('/s3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/') STOREDAS ('PARQUET') ) NO PRIMARY INDEX , PARTITION BY COLUMN ;
The following examples show how to create a tdplyr tbl_teradata on this table and process this tbl_teradata with other tdplyr API's.
Create an object of class "tbl_teradta"on the foreign table
# Create object(s) of class "tbl_teradata" on foreign table which contains data in PARQUET format on S3 > riverflow <- tbl(con, "riverflow_parquet")
Check Properties of the tbl_teradata
- Display the structure of the tbl_teradata.
# Get the structure of tbl_teradata > str(riverflow)
List of 3 $ src :List of 2 ..$ con :Formal class 'Teradata' [package ".GlobalEnv"] with 6 slots .. .. ..@ m_uLog :integer64 0 .. .. ..@ m_bImmediate : logi TRUE .. .. ..@ m_bTraceLog : logi FALSE .. .. ..@ m_bDebugLog : logi FALSE .. .. ..@ m_uConnHandle:integer64 1 .. .. ..@ m_envMutable :<environment: 0x00000000205ad8c0> ..$ disco: NULL ..- attr(*, "class")= chr [1:4] "src_Teradata" "src_dbi" "src_sql" "src" $ ops :List of 2 ..$ x : 'ident' chr "riverflow_parquet" ..$ vars: chr [1:7] "Location" "GageHeight2" "Flow" "site_no" ... ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op" $ types: Named chr [1:7] "varchar" "float" "float" "bigint" ... ..- attr(*, "names")= chr [1:7] "CV" "F" "F" "I8" ... - attr(*, "class")= chr [1:5] "tbl_teradata" "tbl_dbi" "tbl_sql" "tbl_lazy" ...
- Display the columns in the tbl_teradata.
# As seen from create table statement, table has two columns 'Location' and 'PAYLOAD' > colnames(riverflow)
'Location''GageHeight2''Flow''site_no''datetime''Precipitation''GageHeight'
- Check the types of the columns.
# Data type check > riverflow$types
CV 'varchar' F 'float' F 'float' I8 'bigint' CV 'varchar' F 'float' F 'float'
- Print the tbl_teradata.
# Print the tbl_teradata. # Equivalent SQL: # SELECT TOP 10 * FROM riverflow_parquet > as.data.frame(head(riverflow, 10))
A data.frame: 10 × 7 Location GageHeight2 Flow site_no datetime Precipitation GageHeight <chr> <dbl> <dbl> <int64> <chr> <dbl> <dbl> /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/15.parquet 1.96 153 9396100 2018-07-15 00:30 0 1.96 /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/15.parquet 1.95 150 9396100 2018-07-15 01:00 0 1.95 /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/15.parquet 1.92 140 9396100 2018-07-15 01:15 0 1.92 /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/15.parquet 1.92 140 9396100 2018-07-15 01:30 0 1.92 /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/15.parquet 1.88 127 9396100 2018-07-15 02:00 0 1.88 /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/15.parquet 1.87 124 9396100 2018-07-15 02:15 0 1.87 /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/15.parquet 1.90 133 9396100 2018-07-15 01:45 0 1.90 /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/15.parquet 1.93 143 9396100 2018-07-15 00:45 0 1.93 /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/15.parquet 1.96 153 9396100 2018-07-15 00:15 0 1.96 /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/15.parquet 1.98 160 9396100 2018-07-15 00:00 0 1.98
Explore Data
- Select some columns from the tbl_teradata.
> as.data.frame(head(riverflow %>% select(GageHeight2, Flow, site_no, datetime, Precipitation, GageHeight), 10))
A data.frame: 10 × 6 GageHeight2 Flow site_no datetime Precipitation GageHeight <dbl> <dbl> <int64> <chr> <dbl> <dbl> 1.37 44.5 9396100 2018-07-13 00:30 0 1.50 1.32 39.4 9396100 2018-07-13 01:00 0 1.46 1.31 44.5 9396100 2018-07-13 01:15 0 1.50 1.29 42.8 9396100 2018-07-13 01:30 0 1.49 1.24 26.6 9396100 2018-07-13 02:00 0 1.35 1.23 34.8 9396100 2018-07-13 02:15 0 1.43 1.30 34.8 9396100 2018-07-13 01:45 0 1.43 1.34 34.8 9396100 2018-07-13 00:45 0 1.43 1.36 46.3 9396100 2018-07-13 00:15 0 1.51 1.37 53.9 9396100 2018-07-13 00:00 0 1.56
- Filter the data where Flow is equal to 44.5.
> as.data.frame(riverflow %>% filter(Flow == 44.5))
A data.frame: 9 × 7 Location GageHeight2 Flow site_no datetime Precipitation GageHeight <chr> <dbl> <dbl> <int64> <chr> <dbl> <dbl> /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/20.parquet 1.58 44.5 9396100 2018-07-20 16:45 0 1.5 /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/13.parquet 1.37 44.5 9396100 2018-07-13 00:30 0 1.5 /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/19.parquet 1.46 44.5 9396100 2018-07-19 22:00 0 1.5 /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/13.parquet 1.31 44.5 9396100 2018-07-13 01:15 0 1.5 /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/23.parquet 1.23 44.5 9396100 2018-07-23 05:15 0 1.5 /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/21.parquet 1.24 44.5 9396100 2018-07-21 14:45 0 1.5 /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/23.parquet 1.23 44.5 9396100 2018-07-23 06:45 0 1.5 /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/21.parquet 1.24 44.5 9396100 2018-07-21 15:00 0 1.5 /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09396100/2018/07/21.parquet 1.24 44.5 9396100 2018-07-21 15:30 0 1.5
- Sample the data to get only 5 rows.
> as.data.frame(td_sample(df = riverflow, n = 5))
A data.frame: 5 × 8 Location GageHeight2 Flow site_no datetime Precipitation GageHeight sampleid <chr> <dbl> <dbl> <int64> <chr> <dbl> <dbl> <int> /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09474000/2018/07/24.parquet 6.65 0.55 9474000 2018-07-24 05:00 0 3.53 1 /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09400568/2018/07/08.parquet 5.82 0.00 9400568 2018-07-08 11:30 0 6.51 1 /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09513780/2018/07/20.parquet -0.98 0.00 9513780 2018-07-20 04:15 0 -1.44 1 /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09400568/2018/07/07.parquet 5.82 0.00 9400568 2018-07-07 12:30 0 6.51 1 /S3/s3.amazonaws.com/td-usgs-us-east-1/PARQUETDATA/09513780/2018/07/14.parquet -0.98 0.00 9513780 2018-07-14 18:00 0 -1.44 1
- Process the data with aggregate functions.
# Calculate average of all the values in the columns "Flow", "Gageheight2", "site_no", "GageHeight". > as.data.frame(riverflow %>% summarise(mean_Flow = mean(Flow), mean_GageHeight2 = mean(GageHeight2), mean_site_no = mean(site_no), mean_GageHeight = mean(GageHeight)))
A data.frame: 1 × 4 mean_Flow mean_GageHeight2 mean_site_no mean_GageHeight <dbl> <dbl> <dbl> <dbl> 23.66244 3.367212 9443942 2.750457
# Calculate average of all the values in the columns "Flow", "Gageheight2", grouped by the column "site_no". > as.data.frame((riverflow %>% group_by(site_no)) %>% summarise(mean_Flow = mean(Flow), mean_GageHeight2 = mean(GageHeight2)))
A data.frame: 6 × 3 site_no mean_Flow mean_GageHeight2 <int64> <dbl> <dbl> 9474000 3.259117 6.6917280 9396100 54.192747 1.3378045 9513780 0.000000 -0.9813805 9400568 16.829861 6.1807917 9394500 24.844732 5.0242497 9497500 62.423138 1.4001197