Parquet Examples | Native Object Store ( NOS ) | Teradata Vantage - 17.05 - Examples: Reading Parquet Data - Advanced SQL Engine - Teradata Database

Teradata Vantageā„¢ - Native Object Store Getting Started Guide

Product
Advanced SQL Engine
Teradata Database
Release Number
17.05
Release Date
January 2021
Content Type
Programming Reference
Publication ID
B035-1214-175K
Language
English (United States)

The following examples show accessing Parquet-formatted data from Teradata's public Google Cloud Storage bucket.

The following table shows the LOCATION values for the Teradata-supplied public external object stores containing Parquet data. You can run the examples using any of these locations, after first setting up the function mapping here Setting Up to Run the Examples.

Platform LOCATION
Amazon S3 /s3/td-usgs-public.s3.amazonaws.com/PARQUETDATA/
Azure Blob storage /az/akiaxox5jikeotfww4ul.blob.core.windows.net/td-usgs/PARQUETDATA/
Google Cloud Storage /gs/storage.googleapis.com/td-usgs/PARQUETDATA/

If you want to use your own external object store, see Variable Substitutions for Examples for examples of LOCATION, ACCESS_ID/USER, and ACCESS_KEY/PASSWORD.

Using SHOW AUTHORIZATION

SHOW AUTHORIZATION NOS_USR.DefAuth;

If the authorization object does not exist, run the steps in Setting Up to Run the Examples.

Setting Up to Run Parquet Examples

CREATE FOREIGN TABLE riverflow_parquet
, EXTERNAL SECURITY DEFINER TRUSTED DefAuth
(
Location VARCHAR(2048) CHARACTER SET UNICODE CASESPECIFIC
, GageHeight2 DOUBLE PRECISION FORMAT '-ZZZ9.99'
, Flow DOUBLE PRECISION FORMAT '-ZZZZ9.99'
, site_no BIGINT
, datetime VARCHAR(16) CHARACTER SET UNICODE CASESPECIFIC
, Precipitation DOUBLE PRECISION FORMAT '-ZZZ9.99'
, GageHeight DOUBLE PRECISION FORMAT '-ZZZ9.99'
)
USING (
LOCATION ('/gs/storage.googleapis.com/td-usgs/PARQUETDATA/')
STOREDAS ('PARQUET')
) NO PRIMARY INDEX
, PARTITION BY COLUMN ;

See Setting Up to Run Parquet Examples.

Sampling External Data In Place Using a Foreign Table

SELECT TOP 2 * FROM riverflow_parquet;

Sample result:

     Location /GS/storage.googleapis.com/td-usgs/PARQUETDATA/09396100/2018/07/16.parquet
  GageHeight2     1.42
         Flow     34.10
      site_no              9396100
     datetime 2018-07-16 00:15
Precipitation     0.00
   GageHeight     1.42
     Location /GS/storage.googleapis.com/td-usgs/PARQUETDATA/09396100/2018/07/16.parquet
  GageHeight2     1.50
         Flow     44.70
      site_no              9396100
     datetime 2018-07-16 00:00
Precipitation     0.00
   GageHeight     1.50

See Sampling External Data In Place Using a Foreign Table.

Finding Maximum and Minimum Values

SELECT MAX (Flow)
, MIN (Flow)
FROM riverflow_parquet;

Sample result:

Maximum(Flow) Minimum(Flow)
------------- -------------
      1570.00          0.00

See Finding Maximum and Minimum Values.

Querying for a Specific Set of Rows

SELECT MAX(Flow)
FROM riverflow_parquet
WHERE site_no = '09394500';

Sample result:

Maximum(Flow)
-------------
       232.00

See Querying for a Specific Set of Rows.

Previewing the Parquet Schema Using READ_NOS

SELECT TOP 2 * FROM READ_NOS_FM (
USING
STOREDAS('PARQUET')
FULLSCAN('TRUE')
RETURNTYPE('NOSREAD_PARQUET_SCHEMA')
LOCATION ('/gs/storage.googleapis.com/td-usgs/PARQUETDATA/')
) AS D;

Sample result:

                 Location /GS/storage.googleapis.com/td-usgs/PARQUETDATA/09394500/2018/07/03.parquet
        ParquetColumnName GageHeight2
ParquetColumnPhysicalType DOUBLE
 ParquetColumnLogicalType NONE
             TDColumnType FLOAT
   ParquetColumnPrecision           0
       ParquetColumnScale           0
   ParquetColumnMinLength           0
   ParquetColumnMaxLength           0
      ParquetColumnIsNull    1
         ParquetColumnPos    1
                 Location /GS/storage.googleapis.com/td-usgs/PARQUETDATA/09394500/2018/07/01.parquet
        ParquetColumnName GageHeight2
ParquetColumnPhysicalType DOUBLE
 ParquetColumnLogicalType NONE
             TDColumnType FLOAT
   ParquetColumnPrecision           0
       ParquetColumnScale           0
   ParquetColumnMinLength           0
   ParquetColumnMaxLength           0
      ParquetColumnIsNull    1
         ParquetColumnPos    1

See Previewing the Parquet Schema Using READ_NOS.

Viewing the Key Structure

SELECT TOP 2 LOCATION FROM READ_NOS_FM (
USING
RETURNTYPE ('NOSREAD_KEYS')
LOCATION('/gs/storage.googleapis.com/td-usgs/PARQUETDATA/')
) AS D;

Sample result:

Location
------------------------------------------------------------
/GS/storage.googleapis.com/td-usgs/PARQUETDATA/09394500/2018/07/03.parquet
/GS/storage.googleapis.com/td-usgs/PARQUETDATA/09394500/2018/06/28.parquet

See Viewing the Key Structure.

Loading External Data into the Database Using CREATE TABLE AS...WITH DATA

CREATE MULTISET TABLE riverflowprecip_parquet AS (
SELECT CAST(site_no AS CHAR(8)) SiteNo,
CAST(Flow AS FLOAT) ( FORMAT '-ZZZZ9.99') Flow,
CAST(GageHeight AS FLOAT) ( FORMAT '-(3)ZZ9.99') GageHeight,
CAST(datetime AS TIMESTAMP(0) FORMAT 'YYYY-MM-DDBHH:MI') Date_Time
FROM riverflow_parquet
WHERE Precipitation > 0
) WITH DATA
NO PRIMARY INDEX;

Display the number of rows in the table:

SELECT COUNT(*) FROM riverflowprecip_parquet;

Sample result:

Count(*)
-----------
        155

See Loading External Data into the Database Using CREATE TABLE AS...WITH DATA.

Loading External Data into the Database Using INSERT SELECT

CREATE TABLE RiverFlowPermInsert_parquet
(DateTS CHAR(20)
,SiteNo CHAR(8)
,Flow FLOAT FORMAT '-ZZZZ9.99'
,GageHeight FLOAT FORMAT '-ZZZ9.99'
,Precipitation FLOAT FORMAT '-ZZZ9.99')
PRIMARY INDEX (SiteNo);

Insert the external data into the database table:

INSERT INTO RiverFlowPermInsert_parquet
SELECT datetime
,CAST (site_no AS CHAR(8))
,CAST (Flow AS FLOAT)
,CAST (GageHeight AS FLOAT)
,CAST (Precipitation AS FLOAT)
WHERE site_no='09394500'
FROM riverflow_parquet;

Query the data from the database table:

SELECT TOP 2 * FROM RiverFlowPermInsert_parquet;

Sample result:

DateTS SiteNo Flow GageHeight Precipitation
-------------------- -------- --------- ---------- -------------
2018-07-12 01:00     9394500       4.84       5.13          0.00
2018-07-12 00:30     9394500       4.84       5.13          0.00

See Loading External Data into the Database Using INSERT SELECT.

Joining External Data and Database Tables

Create a database dimension table to join to:

CREATE SET TABLE rivernames (
site_no CHAR(8) CHARACTER SET LATIN NOT CASESPECIFIC,
name CHAR(60) CHARACTER SET LATIN NOT CASESPECIFIC)
UNIQUE PRIMARY INDEX ( site_no );

Create the foreign table or ask your database administrator to create the foreign table that is used to populate the dimension table:

CREATE FOREIGN TABLE nos_rivernames
, EXTERNAL SECURITY DEFINER TRUSTED DefAuth
( Location VARCHAR(2048) CHARACTER SET UNICODE CASESPECIFIC,
PAYLOAD DATASET INLINE LENGTH 64000 STORAGE FORMAT CSV
)
USING (
LOCATION('/gs/storage.googleapis.com/RIVERS/rivers.csv')
);

Populate the dimension table:

INSERT INTO rivernames
SELECT payload..site_no, payload..name
FROM nos_rivernames;

Join the dimension table and external data (foreign table):

SELECT DISTINCT name(CHAR(60))
FROM riverflow_parquet rf, rivernames rn
WHERE rf.site_no = rn.site_no
AND rf.Precipitation > 0.1
ORDER BY 1;

Sample result:

name
----------------------------------
GILA RIVER AT KELVIN
LITTLE COLORADO RIVER AT WOODRUFF
NEW RIVER NEAR ROCK SPRINGS
POLACCA WASH NEAR SECOND MESA
PUERCO RIVER NEAR CHAMBERS
SALT RIVER NEAR CHRYSOTILE

See Joining External Data and Database Tables.

Filtering External Parquet Data From a Foreign Table

CREATE FOREIGN TABLE riverflow_parquet_path
, EXTERNAL SECURITY DEFINER TRUSTED DefAuth
(
Location VARCHAR(2048) CHARACTER SET UNICODE CASESPECIFIC
, GageHeight2 DOUBLE PRECISION FORMAT '-ZZZ9.99'
, Flow DOUBLE PRECISION FORMAT '-ZZZZ9.99'
, site_no BIGINT
, datetime VARCHAR(16) CHARACTER SET UNICODE CASESPECIFIC
, Precipitation DOUBLE PRECISION FORMAT '-ZZZ9.99'
, GageHeight DOUBLE PRECISION FORMAT '-ZZZ9.99'
)
USING (
LOCATION ('/gs/storage.googleapis.com/td-usgs/PARQUETDATA/')
PATHPATTERN ('$data/$siteno/$year/$month/$day')
STOREDAS ('PARQUET')
) NO PRIMARY INDEX
, PARTITION BY COLUMN;

Filter on a specific column in the Parquet data:

SELECT TOP 2 GageHeight TheGageHeight
, Flow TheFlow
FROM riverflow_parquet_path
WHERE site_no = '09394500';

Sample result:

TheGageHeight TheFlow
------------- ---------
         5.56     31.20
         5.13      4.84

See Filtering External Parquet Data From a Foreign Table.

Querying a Foreign Table with Path Filtering

Use path filtering to narrow down search results:

SELECT TOP 2 *
FROM riverflow_parquet_path
WHERE $PATH.$year = '2018'
AND $PATH.$month = '07'
AND $PATH.$day = '01.parquet'
AND $PATH.$siteno = '09394500';

Sample result:

     Location /GS/storage.googleapis.com/td-usgs/PARQUETDATA/09394500/2018/07/01.parquet
  GageHeight2     4.81
         Flow      0.00
      site_no              9394500
     datetime 2018-07-01 00:15
Precipitation     0.00
   GageHeight     4.75
     Location /GS/storage.googleapis.com/td-usgs/PARQUETDATA/09394500/2018/07/01.parquet
  GageHeight2     4.82
         Flow      0.01
      site_no              9394500
     datetime 2018-07-01 00:00
Precipitation     0.00
   GageHeight     4.75

Filter on the siteno using path filtering:

SELECT TOP 2 GageHeight TheGageHeight
, Flow TheFlow
FROM riverflow_parquet_path
WHERE $path.$siteno = '09394500';

Sample result:

TheGageHeight TheFlow
------------- ---------
         5.04      3.53
         5.13      4.84

See Querying a Foreign Table with Path Filtering.

Combining Path and Payload Filtering

Combine path and payload filtering:

SELECT COUNT(*)
FROM riverflow_parquet_path
WHERE GageHeight > 5
AND $path.$siteno = 09394500
AND $path.$year = '2018'
AND $path.$month = '07';

Sample result:

   Count(*)
-----------
        834

See Combining Path and Column Filtering .

Using Path Variables as Columns in a View

Create the view of the foreign table:

CREATE VIEW riverflowview_parquet AS (
SELECT CAST($path.$siteno AS CHAR(10)) TheSite,
CAST($path.$year AS CHAR(4)) TheYear,
CAST($path.$month AS CHAR(2)) TheMonth,
CAST(SUBSTR($path.$day, 1, 2) AS CHAR(2)) TheDay,
Flow,
GageHeight GageHeight1,
Precipitation,
GageHeight2
FROM riverflow_parquet_path);

See Using Path Variables as Columns in a View.

Using a Foreign Table View with Path Filtering

REPLACE VIEW riverflowview_parquet AS (
SELECT CAST($path.$siteno AS CHAR(10)) TheSite,
CAST($path.$year AS CHAR(4)) TheYear,
CAST($path.$month AS CHAR(2)) TheMonth,
CAST(SUBSTR($path.$day, 1, 2) AS CHAR(2)) TheDay,
Flow,
GageHeight GageHeight1,
Precipitation,
GageHeight2
FROM riverflow_parquet_path
WHERE TheSite = site_no);

Query the view:

SELECT TOP 2 * FROM riverflowview_parquet;

Sample result:

      TheSite  09396100
      TheYear  2018
     TheMonth  07
       TheDay  16
         Flow      34.10
  GageHeight1      1.42
Precipitation      0.00
  GageHeight2      1.42

      TheSite  09396100
      TheYear  2018
     TheMonth  07
       TheDay  16
         Flow      44.70
  GageHeight1      1.50
Precipitation      0.00
  GageHeight2      1.50

Filter the view:

SELECT thesite,COUNT(*)
FROM riverflowview_parquet WHERE thesite='09396100'
GROUP BY 1;

Sample result:

TheSite        Count(*)
----------  -----------
09396100           2906

Run EXPLAIN on the query to see how the filtering is done:

EXPLAIN
SELECT thesite,COUNT(*)
FROM riverflowview_parquet WHERE thesite='09396100'
GROUP BY 1;

Sample result:

Explanation
------------------------------------------------------------------------
[...]
3) We do a single-AMP RETRIEVE step from
NOS_USR.riverflow_parquet_path in view riverflowview_parquet
metadata by way of an all-rows scan with a condition of (
"(TD_SYSFNLIB.NosExtractVarFromPath (
NOS_USR.riverflow_parquet_path in view
riverflowview_parquet.Location, '/S3/td-usgs.s3.amazonaws.com/', 2
)(CHAR(10), CHARACTER SET UNICODE, NOT CASESPECIFIC))= '09396100 '")

into Spool 3 (all_amps), which is built locally on that AMP. Then
we do a SORT to order Spool 3 by the sort key as the field_id
list( 3, 2). The size of Spool 3 is estimated with no confidence
to be 17 rows (11,985 bytes). The estimated time for this step is
0.60 seconds.
4) We do an all-AMPs RETRIEVE step in TD_Map1 from Spool 3 (Last Use)
by way of an all-rows scan into Spool 2 (all_amps), which is
binpacked and redistributed by size to all AMPs in TD_Map1. The
size of Spool 2 is estimated with no confidence to be 17 rows (
12,121 bytes). The estimated time for this step is 0.16 seconds.
5) We do an all-AMPs SUM step in TD_MAP1 to aggregate from 3 column
partitions of NOS_USR.riverflow_parquet_path in view
riverflowview_parquet by way of external metadata in Spool 2 (Last
Use) with a condition of ("(TD_SYSFNLIB.NosExtractVarFromPath (
NOS_USR.riverflow_parquet_path in view
riverflowview_parquet.Location, '/S3/td-usgs.s3.amazonaws.com/', 2
)(CHAR(10), CHARACTER SET UNICODE, NOT CASESPECIFIC)(FLOAT, FORMAT
'-9.99999999999999E-999'))= (NOS_USR.riverflow_parquet_path in
view riverflowview_parquet.site_no)"), and the grouping identifier
in field 1. Aggregate Intermediate Results are computed globally,
then placed in Spool 5 in TD_Map1. The size of Spool 5 is
estimated with no confidence to be 1,215 rows (1,703,430 bytes).
The estimated time for this step is 0.77 seconds.
[...]

See Using a View with Path Filtering.