Parquet Examples | Native Object Store ( NOS ) | Teradata Vantage - 17.10 - Examples: Reading Parquet Data - Advanced SQL Engine - Teradata Database

Teradata Vantage™ - Native Object Store Getting Started Guide

Product
Advanced SQL Engine
Teradata Database
Release Number
17.10
Release Date
July 2021
Content Type
Programming Reference
Publication ID
B035-1214-171K
Language
English (United States)

The following examples show accessing Parquet-formatted data from Teradata's public Google Cloud Storage bucket.

The following table shows the LOCATION values for the Teradata-supplied public external object stores containing Parquet data. You can run the examples using any of these locations, after first setting up the function mapping here Setting Up to Run the Examples.

Platform LOCATION
Amazon S3 /s3/td-usgs-public.s3.amazonaws.com/PARQUETDATA/
Azure Blob storage /az/akiaxox5jikeotfww4ul.blob.core.windows.net/td-usgs/PARQUETDATA/
Google Cloud Storage /gs/storage.googleapis.com/td-usgs/PARQUETDATA/

If you want to use your own external object store, see Variable Substitutions for Examples for examples of LOCATION, ACCESS_ID/USER, and ACCESS_KEY/PASSWORD.

Using SHOW AUTHORIZATION

SHOW AUTHORIZATION MyAuthObj;

If the authorization object does not exist, run the steps in Setting Up to Run the Examples.

Setting Up to Run Parquet Examples

CREATE FOREIGN TABLE riverflow_parquet
, EXTERNAL SECURITY MyAuthObj
USING ( LOCATION ('/s3/td-usgs-public.s3.amazonaws.com/PARQUETDATA/') );

See Setting Up to Run Parquet Examples.

Sampling External Data In Place Using a Foreign Table

SELECT TOP 2 * FROM riverflow_parquet;

See Sampling External Data In Place Using a Foreign Table.

Finding Maximum and Minimum Values

SELECT MAX (Flow), MIN (Flow)
FROM riverflow_parquet;

See Finding Maximum and Minimum Values.

Querying for a Specific Set of Rows

SELECT MAX(Flow)
FROM riverflow_parquet
WHERE site_no = '09394500';

See Querying for a Specific Set of Rows.

Previewing the Parquet Schema Using READ_NOS

SELECT * FROM  (
LOCATION='/s3/td-usgs-public.s3.amazonaws.com/PARQUETDATA/'
AUTHORIZATION=MyAuthObj
RETURNTYPE='NOSREAD_SCHEMA'
) AS d;

An alternative is to use NOSREAD_PARQUET_SCHEMA, which must be used if the FULLSCAN option is used.

SELECT * FROM  (
LOCATION='/s3/td-usgs-public.s3.amazonaws.com/PARQUETDATA/'
AUTHORIZATION=MyAuthObj
FULLSCAN='TRUE'
RETURNTYPE='NOSREAD_PARQUET_SCHEMA'
) AS d;

See Previewing the Parquet Schema Using READ_NOS.

Viewing the Key Structure

SELECT * FROM  (
LOCATION='/s3/td-usgs-public.s3.amazonaws.com/PARQUETDATA/'
AUTHORIZATION=MyAuthObj
RETURNTYPE='NOSREAD_KEYS'
) AS d;

See Viewing the Key Structure.

Viewing the Parquet Content using READ_NOS

SELECT TOP 2 * FROM  (
LOCATION='/s3/td-usgs-public.s3.amazonaws.com/PARQUETDATA/'
AUTHORIZATION=MyAuthObj
) AS d;

See Viewing the Parquet Content Using READ_NOS.

Loading External Data into the Database Using CREATE TABLE AS … WITH DATA

CREATE MULTISET TABLE riverflowprecip_parquet(site_no,Flow,GageHeight,datetime) AS (
SELECT site_no,Flow,GageHeight,datetime
FROM riverflow_parquet
WHERE Precipitation > 0
) WITH DATA
NO PRIMARY INDEX;

Display the number of rows in the table:

SELECT COUNT(*) FROM riverflowprecip_parquet;

See Loading External Data into the Database Using CREATE TABLE AS WITH DATA.

Loading External Data into the Database Using INSERT ... SELECT

CREATE TABLE RiverFlowPermInsert_parquet (
DateTS CHAR(20)
,SiteNo CHAR(8)
,Flow FLOAT FORMAT '-ZZZZ9.99'
,GageHeight FLOAT FORMAT '-ZZZ9.99'
,Precipitation FLOAT FORMAT '-ZZZ9.99')
PRIMARY INDEX (SiteNo);

Insert the external data into the database table:

INSERT INTO RiverFlowPermInsert_parquet
SELECT datetime,site_no,Flow,GageHeight,Precipitation
WHERE site_no='9497500'
FROM riverflow_parquet;

Query the data from the database table:

SELECT TOP 2 * FROM RiverFlowPermInsert_parquet;

See Loading External Data into the Database Using INSERT ... SELECT.

Joining External Data and Database Tables

Create a database dimension table to join to:

CREATE SET TABLE rivernames(
site_no CHAR(8) CHARACTER SET LATIN NOT CASESPECIFIC,
name CHAR(100) CHARACTER SET LATIN NOT CASESPECIFIC)
UNIQUE PRIMARY INDEX ( site_no ) ;

Create the foreign table or ask your database administrator to create the foreign table that is used to populate the dimension table:

CREATE FOREIGN TABLE nos_rivernames
, EXTERNAL SECURITY  MyAuthObj
USING ( LOCATION('/s3/td-usgs-public.s3.amazonaws.com/RIVERS/rivers.csv') );

Populate the dimension table:

INSERT INTO rivernames
SELECT CAST(site_no AS CHAR(8)), name
FROM nos_rivernames;

Join the dimension table and external data (foreign table):

SELECT DISTINCT name(CHAR(100))
FROM riverflow_parquet rf, rivernames rn
WHERE rf.site_no = rn.site_no
AND rf.Precipitation > 0.1
ORDER BY 1;

See Joining External Data and Database Tables.

Filtering External Parquet Data From a Foreign Table

CREATE FOREIGN TABLE riverflow_parquet_path
, EXTERNAL SECURITY MyAuthObj
USING ( 
LOCATION ('/s3/td-usgs-public.s3.amazonaws.com/PARQUETDATA/')
PATHPATTERN ('$data/$siteno/$year/$month/$day')
);

Filter on a specific column in the Parquet data:

SELECT TOP 2 GageHeight, Flow
FROM riverflow_parquet_path
WHERE site_no = '09394500';

See Filtering External Parquet Data From a Foreign Table.

Querying a Foreign Table with Path Filtering

Use path filtering to narrow down search results:

SELECT TOP 2 *
FROM riverflow_parquet_path
WHERE $PATH.$year = '2018'
AND $PATH.$month = '07'
AND $PATH.$day = '01.parquet'
AND $PATH.$siteno = '09394500';

See Querying a Foreign Table with Path Filtering.

Combining Path and Payload Filtering

Combine path and payload filtering:

SELECT COUNT(*)
FROM riverflow_parquet_path
WHERE GageHeight > 5
AND $path.$siteno = 09394500
AND $path.$year = '2018'
AND $path.$month = '07';

See Combining Path and Column Filtering .

Using Path Variables as Columns in a View

Create the view of the foreign table:

CREATE VIEW riverflowview_parquet AS (
SELECT 
CAST($path.$siteno AS CHAR(10)) TheSite,
CAST($path.$year AS CHAR(4)) TheYear,
CAST($path.$month AS CHAR(2)) TheMonth,
CAST(SUBSTR($path.$day, 1, 2) AS CHAR(2)) TheDay,
Flow,
GageHeight GageHeight1,
Precipitation,
GageHeight2
FROM riverflow_parquet_path
);

See Using Path Variables as Columns in a View.

Using a Foreign Table View with Path Filtering

REPLACE VIEW riverflowview_parquet AS (
SELECT CAST($path.$siteno AS CHAR(10)) TheSite,
CAST($path.$year AS CHAR(4)) TheYear,
CAST($path.$month AS CHAR(2)) TheMonth,
CAST(SUBSTR($path.$day, 1, 2) AS CHAR(2)) TheDay,
Flow,
GageHeight GageHeight1,
Precipitation,
GageHeight2
FROM riverflow_parquet_path
WHERE TheSite = site_no);

Query the view:

SELECT TOP 2 * FROM riverflowview_parquet;

Filter the view:

SELECT TheSite,COUNT(*)
FROM riverflowview_parquet 
WHERE TheSite='09396100'
GROUP BY 1;

Run EXPLAIN on the query to see how the filtering is done:

EXPLAIN
SELECT thesite,COUNT(*)
FROM riverflowview_parquet WHERE thesite='09396100'
GROUP BY 1;

See Using a View with Path Filtering.