copy_to_sql() | Teradata Package for Python - copy_to_sql()

copy_to_sql() | Teradata Package for Python - copy_to_sql() - Teradata Vantage

Teradata® VantageCloud Lake

Deployment

VantageCloud

Edition

Lake

Product

Teradata Vantage

Published

January 2023

Language

English (United States)

Last Update

2024-04-03

dita:mapPath

phg1621910019905.ditamap

dita:ditavalPath

pny1626732985837.ditaval

dita:id

phg1621910019905

Use the copy_to_sql() function to create a table in Vantage based on a teradataml DataFrame or a pandas DataFrame.

The function takes a teradataml DataFrame or a pandas DataFrame and a table name as arguments, and generates DDL and DML commands that creates a table in Vantage. You can also specify the name of the schema in the database to write the table to. If no schema name is specified, the default database schema is used.

Required Arguments:

df: Specifies the Pandas or teradataml DataFrame object to be saved.
table_name: Specifies the name of the table to be created in Vantage.

Optional Arguments:

schema_name: Specifies the name of the SQL schema in Vantage to write to.
if_exists: Specifies the action to take when table already exists in the database.
Possible values are:
- 'fail': If table exists, do nothing;
- 'replace': If table exists, drop it, recreate it, and insert data;
- 'append': If table exists, insert data. Create if does not exist.
The default value is 'append'.

Replacing a table with the contents of a teradataml DataFrame based on the same underlying table is not supported.
index: Specifies whether to save pandas DataFrame index as a column or not. Possible values are True or False.
The default value is False.
Only use as True when attempting to save Pandas DataFrames (and not on teradataml DataFrames).
index_label: Specifies the column labels for pandas DataFrame index columns. If no value is given and index is True, then a default label 'index_label' is used.
The default value is None.
Only use this argument when attempting to save Pandas DataFrames (and not on teradataml DataFrames).
If this argument is not specified (defaulted to None or is empty) and argument index is set to True, then the names property of the DataFrames index is used as the label(s), and if that too is None or empty, then:
- A default label 'index_label' or 'level_0' (when 'index_label' is already taken) is used when index is standard.
- Default labels 'level_0', 'level_1', etc. are used when index is multi-level index.
primary_index: Specifies the column(s) to use as primary index while creating tables in Vantage. When None (default value), No Primary Index (NOPI) tables are created.
For example:
- primary_index = 'my_primary_index'
- primary_index = ['my_primary_index1', 'my_primary_index2', 'my_primary_index3']
temporary: Specifies whether to create Vantage tables as permanent or volatile.
Possible values are:
- True: Creates volatile tables, and schema_name is ignored.
- False: Creates permanent tables.
The default value is False.

Volatile tables are only supported on the Block File System on the primary cluster. They are not available for the Object File System.

types: Specifies required data types for requested columns to be saved in Vantage.

The default value is None.

This argument accepts a dictionary of column names and their required teradatasqlalchemy types as key-value pairs, allowing to specify a subset of the columns of a specific type.

When only a subset of all columns are provided, the rest are defaulted to appropriate types.

When types argument is not provided, all column types are appropriately assigned and defaulted. The column types are assigned as listed in the following table.

pandas/NumPy Type	teradatasqlalchemy Type
int32	INTEGER
int64	BIGINT
bool	BYTEINT
float32/float64	FLOAT
datetime64/datatime64[ns]	TIMESTAMP
datatime64[ns,time_zone]	TIMESTAMP(time_zone=True)
Any other data type	VARCHAR(configure.default_varchar_size)

primary_time_index_name: Specifies a name for the Primary Time Index (PTI) when the table to be created must be a PTI table.
This argument is not required or used when the table to be created is not a PTI table. It is ignored if specified without the argument timecode_column.
timecode_column: Specifies the column in the DataFrame that reflects the form of the timestamp data in the time series.
This argument is required when the DataFrame must be saved as a PTI table.

This column is the TD_TIMECODE column in the table created. It must be of the SQL type TIMESTAMP(n), TIMESTAMP(n) WITH TIME ZONE, or DATE, corresponding to the Python types datetime.datetime or datetime.date, or Pandas dtype datetime64[ns].

This argument is not required when the table to be created is not a PTI table. When specifying this argument, an attempt to create a PTI table is made. If this argument is specified, primary_index is ignored.
timezero_date: Specifies the earliest time series data that the PTI accepts; a date that precedes the earliest date in the time series data. Value specified must be of the following format: DATE 'YYYY-MM-DD'.
The default value is: DATE '1970-01-01'.
- This argument is used when the DataFrame must be saved as a PTI table.
- This argument is not required or used when the table to be created is not a PTI table.
- It is ignored if specified without the timecode_column argument.
timebucket_duration: Specifies a duration that serves to break up the time continuum in the time series data into discrete groups or buckets.
Specified using the formal form time_unit(n), where n is a positive integer, and time_unit can be any of the following: CAL_YEARS, CAL_MONTHS, CAL_DAYS, WEEKS, DAYS, HOURS, MINUTES, SECONDS, MILLISECONDS, OR MICROSECONDS.
- This argument is required if columns_list is not specified or is empty.
- This argument is used when the DataFrame must be saved as a PTI table.
- This argument is not required or used when the table to be created is not a PTI table.
- It is ignored if specified without the timecode_column argument.
columns_list: Specifies a list of one or more PTI table column names.
- This argument is required if timebucket_duration is not specified or is empty.
- This argument is used when the DataFrame must be saved as a PTI table.
- This argument is not required or used when the table to be created is not a PTI table.
- It is ignored if specified without the timecode_column argument.
sequence_column: Specifies the column of type Integer containing the unique identifier for time series data reading when they are not unique in time.
- When specified, implies SEQUENCED, meaning more than one reading from the same sensor may have the same timestamp. This column is the TD_SEQNO column in the table created.
- When not specified, implies NONSEQUENCED, meaning there is only one sensor reading per timestamp. This is the default.
- This argument is used when the DataFrame must be saved as a PTI table.
- This argument is not required or used when the table to be created is not a PTI table.
- It is ignored if specified without the timecode_column argument.
seq_max: Specifies the maximum number of sensor data rows that can have the same timestamp. Can be used when 'sequenced' is True.
Accepted values range from 1 to 2147483647, with default value 20000.
- This argument is used when the DataFrame must be saved as a PTI table.
- This argument is not required or used when the table to be created is not a PTI table.
- It is ignored if specified without the timecode_column argument.
set_table: Specifies a flag to determine whether to create a SET or a MULTISET table.
- When True, a SET table is created.
- When False, a MULTISET table is created. This is the default value.
- Specifying set_table=True also requires specifying primary_index or timecode_column.
- Creating SET table (set_table=True) may result in
  - an error if the source is a Pandas DataFrame having duplicate rows;
  - loss of duplicate rows if the source is a teradataml DataFrame.
- This argument has no effect if the table already exists and if_exists='append'.
This argument is only supported on the Block File System on the primary cluster. It is not available for the Object File System.
match_column_order: Specifies whether the order of the columns in existing table matches the order of the columns in the "df" or not.
When set to False, the DataFrame to be loaded can have any order and number of columns.

The default value is True.
chunksize: Specifies the number of rows to be loaded in a batch, if DataFrame is pandas DataFrame.
The default value is 16383.

Example 1: Create a new table in Vantage based on a pandas DataFrame

You must import pandas and copy_to_sql first.

>>>import pandas as pd
>>>from teradataml.dataframe.copy_to import copy_to_sql

Assume you create the "sales" table as follows:

>>>sales = [{'accounts': 'Jones LLC', 'Jan': 150, 'Feb': 200.0, 'Mar': 140, 'datetime':'2017-04-01'},
            {'accounts': 'Alpha Co',  'Jan': 200, 'Feb': 210.0, 'Mar': 215, 'datetime': '2017-04-01'},
            {'accounts': 'Blue Inc',  'Jan': 50,  'Feb': 90.0,  'Mar': 95, 'datetime': '2017-04-01' }]

And a pandas DataFrame "pdf" is created from the table "sales", using command:

pdf = pd.DataFrame(sales)

Enter pdf to display the pandas DataFrame:

>>>pdf
     Feb  Jan  Mar   accounts    datetime
0  200.0  150  140  Jones LLC  2017-04-01
1  210.0  200  215   Alpha Co  2017-04-01
2   90.0   50   95   Blue Inc  2017-04-01

Use the copy_to_sql() function to create a new Vantage table "pandas_sales" based on the pandas DataFrame "pdf" that is created in the prerequisite.

>>>copy_to_sql(df = pdf, table_name = "pandas_sales", primary_index="accounts", if_exists="replace")

Verify that the new table "pandas_sales" exists in Vantage using the DataFrame() function which creates a DataFrame based on an existing table.

>>>df = DataFrame("pandas_sales")

>>> df
             Feb  Jan  Mar                    datetime
accounts
Jones LLC  200.0  150  140  2017-04-01 00:00:00.000000
Alpha Co   210.0  200  215  2017-04-01 00:00:00.000000
Blue Inc    90.0   50   95  2017-04-01 00:00:00.000000

Example 2: Use the optional index and index_label parameters, and use the index as Primary Index.

>>> copy_to_sql(df = pdf, table_name = "pandas_sales", index=True, index_label="idx", primary_index="idx", if_exists="replace")

>>> df = DataFrame("pandas_sales")
>>> df
       Feb  Jan  Mar   accounts                    datetime
idx
0    200.0  150  140  Jones LLC  2017-04-01 00:00:00.000000
2     90.0   50   95   Blue Inc  2017-04-01 00:00:00.000000
1    210.0  200  215   Alpha Co  2017-04-01 00:00:00.000000

Example 3: Create a new table in Vantage based on a teradataml DataFrame

>>> df = DataFrame("sales")
>>> df
              Feb   Jan   Mar   Apr    datetime
accounts                                      
Blue Inc     90.0    50    95   101  04/01/2017
Alpha Co    210.0   200   215   250  04/01/2017
Jones LLC   200.0   150   140   180  04/01/2017
Yellow Inc   90.0  None  None  None  04/01/2017
Orange Inc  210.0  None  None   250  04/01/2017
Red Inc     200.0   150   140  None  04/01/2017

>>> copy_to_sql(df = df, table_name = "sales_df1", primary_index="accounts", if_exists="replace")

>>> df1 = DataFrame("sales_df1")
>>> df1
              Feb   Jan   Mar   Apr  datetime
accounts                                    
Blue Inc     90.0    50    95   101  17/01/04
Orange Inc  210.0  None  None   250  17/01/04
Red Inc     200.0   150   140  None  17/01/04
Yellow Inc   90.0  None  None  None  17/01/04
Jones LLC   200.0   150   140   180  17/01/04
Alpha Co    210.0   200   215   250  17/01/04

Example 4: Create a new Primary Time Index Table in Vantage from a Pandas DataFrame

>>> copy_to_sql(df = pdf, table_name = "pandas_sales_pti", timecode_column="datetime", columns_list="accounts")

>>> pandas_sales_pti = DataFrame('pandas_sales_pti')
>>> pandas_sales_pti
                          TD_TIMECODE    Feb  Jan  Mar
accounts
Jones LLC  2017-04-01 00:00:00.000000  200.0  150  140
Alpha Co   2017-04-01 00:00:00.000000  210.0  200  215
Blue Inc   2017-04-01 00:00:00.000000   90.0   50   95