Use the copy_to_sql() function to create a table in Vantage based on a teradataml DataFrame or a pandas DataFrame.
The function takes a teradataml DataFrame or a pandas DataFrame and a table name as arguments, and generates DDL and DML commands that creates a table in Vantage. You can also specify the name of the schema in the database to write the table to. If no schema name is specified, the default database schema is used.
Required Parameters
- df
- Specifies the Pandas or teradataml DataFrame object to be saved.
- table_name
- Specifies the name of the table to be created in Vantage.
Optional Parameters
- schema_name
- Specifies the name of the SQL schema in Vantage to write to.
- if_exists
- Specifies the action to take when table already exists in the database. Possible values are:
- 'fail': If table exists, do nothing;
- 'replace': If table exists, drop it, recreate it, and insert data;
- 'append': If table exists, insert data. Create if does not exist.
The default value is 'append'.
Replacing a table with the contents of a teradataml DataFrame based on the same underlying table is not supported. - index
- Specifies whether to save pandas DataFrame index as a column or not. Possible values are True or False. The default value is False.Only use as True when attempting to save Pandas DataFrames (and not on teradataml DataFrames).
- index_label
- Specifies the column labels for pandas DataFrame index columns. If no value is given and index is True, then a default label 'index_label' is used.The default value is None.Only use this argument when attempting to save Pandas DataFrames (and not on teradataml DataFrames).If this argument is not specified (defaulted to None or is empty) and argument index is set to True, then the names property of the DataFrames index is used as the label(s), and if that too is None or empty, then:
- A default label 'index_label' or 'level_0' (when 'index_label' is already taken) is used when index is standard.
- Default labels 'level_0', 'level_1', etc. are used when index is multi-level index.
- primary_index
- Specifies the columns to use as primary index while creating tables in Vantage. When None (default value), No Primary Index (NOPI) tables are created. For example:
- primary_index = 'my_primary_index'
- primary_index = ['my_primary_index1', 'my_primary_index2', 'my_primary_index3']
- temporary
- Specifies whether to create Vantage tables as permanent or volatile.Possible values are:
- True: Creates volatile tables, and schema_name is ignored.
- False: Creates permanent tables.
The default value is False.
- types
- Specifies required data types for requested columns to be saved in Vantage.
The default value is None.
This argument accepts a dictionary of column names and their required teradatasqlalchemy types as key-value pairs, allowing to specify a subset of the columns of a specific type.
- When only a subset of all columns are provided, the rest are defaulted to appropriate types.
- When types argument is not provided, all column types are appropriately assigned and defaulted. The column types are assigned as listed in the following table.
pandas/NumPy Type teradatasqlalchemy Type int32 INTEGER int64 BIGINT bool BYTEINT float32/float64 FLOAT datetime64/datatime64[ns] TIMESTAMP datatime64[ns,time_zone] TIMESTAMP(time_zone=True) Any other data type VARCHAR(configure.default_varchar_size)
- primary_time_index_name
- Specifies a name for the Primary Time Index (PTI) when the table to be created must be a PTI table.This argument is not required or used when the table to be created is not a PTI table. It is ignored if specified without the argument timecode_column.
- timecode_column
- Specifies the column in the DataFrame that reflects the form of the timestamp data in the time series. This argument is required when the DataFrame must be saved as a PTI table.
This column is the TD_TIMECODE column in the table created. It must be of the SQL type TIMESTAMP(n), TIMESTAMP(n) WITH TIME ZONE, or DATE, corresponding to the Python types datetime.datetime or datetime.date, or Pandas dtype datetime64[ns].
This argument is not required when the table to be created is not a PTI table. When specifying this argument, an attempt to create a PTI table is made. If this argument is specified, primary_index is ignored. - timezero_date
- Specifies the earliest time series data that the PTI accepts; a date that precedes the earliest date in the time series data. Value specified must be of the following format: DATE 'YYYY-MM-DD'.
The default value is: DATE '1970-01-01'.
- This argument is used when the DataFrame must be saved as a PTI table.
- This argument is not required or used when the table to be created is not a PTI table.
- It is ignored if specified without the timecode_column argument.
- timebucket_duration
- Specifies a duration that serves to break up the time continuum in the time series data into discrete groups or buckets.
Specified using the formal form time_unit(n), where n is a positive integer, and time_unit can be any of the following: CAL_YEARS, CAL_MONTHS, CAL_DAYS, WEEKS, DAYS, HOURS, MINUTES, SECONDS, MILLISECONDS, OR MICROSECONDS.
- This argument is required if columns_list is not specified or is empty.
- This argument is used when the DataFrame must be saved as a PTI table.
- This argument is not required or used when the table to be created is not a PTI table.
- It is ignored if specified without the timecode_column argument.
- columns_list
- Specifies a list of one or more PTI table column names.
- This argument is required if timebucket_duration is not specified or is empty.
- This argument is used when the DataFrame must be saved as a PTI table.
- This argument is not required or used when the table to be created is not a PTI table.
- It is ignored if specified without the timecode_column argument.
- sequence_column
- Specifies the column of type Integer containing the unique identifier for time series data reading when they are not unique in time.
- When specified, implies SEQUENCED, meaning more than one reading from the same sensor may have the same timestamp. This column is the TD_SEQNO column in the table created.
- When not specified, implies NONSEQUENCED, meaning there is only one sensor reading per timestamp. This is the default.
- This argument is used when the DataFrame must be saved as a PTI table.
- This argument is not required or used when the table to be created is not a PTI table.
- It is ignored if specified without the timecode_column argument.
- seq_max
- Specifies the maximum number of sensor data rows that can have the same timestamp. Can be used when 'sequenced' is True.
Accepted values range from 1 to 2147483647, with default value 20000.
- This argument is used when the DataFrame must be saved as a PTI table.
- This argument is not required or used when the table to be created is not a PTI table.
- It is ignored if specified without the timecode_column argument.
- set_table
- Specifies a flag to determine whether to create a SET or a MULTISET table.
- When True, a SET table is created.
- When False, a MULTISET table is created. This is the default value.
- Specifying set_table=True also requires specifying primary_index or timecode_column.
- Creating SET table (set_table=True) may result in
- an error if the source is a Pandas DataFrame having duplicate rows;
- loss of duplicate rows if the source is a teradataml DataFrame.
- This argument has no effect if the table already exists and if_exists='append'.
- match_column_order
- Specifies whether the order of the columns in existing table matches the order of the columns in the "df" or not.
When set to False, the DataFrame to be loaded can have any order and number of columns.
The default value is True.
- chunksize
- Specifies the number of rows to be loaded in a batch, if DataFrame is pandas DataFrame.
The default value is 16383.
- partition_by
- Specifies the columns on which partition should be created while creating the table.
- "partition_by", "partition_by_case" and "partition_by_range" are mutually exclusive.
- "primary_index" should be specified when "partition_by" is used.
- Not applicable for PTI tables.
- partition_by_case
- Specifies different cases to partition the index while creating table.
- "partition_by", "partition_by_case" and "partition_by_range" are mutually exclusive.
- "primary_index" should be specified when "partition_by_case" is used.
- Not applicable for PTI tables.
- partition_by_range
- Specifies the range of values on which partition should be created while creating a table.
- "partition_by", "partition_by_case" and "partition_by_range" are mutually exclusive.
- "primary_index" should be specified when "partition_by_range" is used.
- Not applicable for PTI tables.
- sub_partition
- Specifies the details to subpartition the main partition according to the value provided while creating the table.
- "sub_partition" is applicable only when "partition_by_range" is specified.
- Not applicable for PTI tables.
- kwargs
- Optional keyword arguments.
- valid_time_columns
Specifies the names of the valid time columns to be referred in "df".
When "valid_time_columns" is specified, then function considers these columns as valid time dimension columns and creates a valid time dimension temporal table if table does not exist.
If a string is provided, the column must be of PERIOD type.
- derived_columnSpecifies the name of the derived column to be kept in the temporal table.
- Argument is ignored if "valid_time_columns" are not specified.
- Argument is considered only if copy_to_sql() is creating a table.
- If "valid_time_columns" is specified and "derived_column" is not specified, then copy_to_sql() automatically creates a derived column by adding "_" between the columns mentioned in "valid_time_columns". For example, if "valid_time_columns" is ('col1', 'col2') and "derived_column" is not specified, then copy_to_sql() creates table with derived column name as 'col1_col2'.
- valid_time_columns
Example 1: Create a new table in Vantage based on a pandas DataFrame
You must import pandas and copy_to_sql first.
>>>import pandas as pd >>>from teradataml.dataframe.copy_to import copy_to_sql
Assume you create the "sales" table as follows:
>>>sales = [{'accounts': 'Jones LLC', 'Jan': 150, 'Feb': 200.0, 'Mar': 140, 'datetime':'2017-04-01'},
{'accounts': 'Alpha Co', 'Jan': 200, 'Feb': 210.0, 'Mar': 215, 'datetime': '2017-04-01'},
{'accounts': 'Blue Inc', 'Jan': 50, 'Feb': 90.0, 'Mar': 95, 'datetime': '2017-04-01' }]
And a pandas DataFrame "pdf" is created from the table "sales", using command:
pdf = pd.DataFrame(sales)
Enter pdf to display the pandas DataFrame:
>>>pdf
Feb Jan Mar accounts datetime
0 200.0 150 140 Jones LLC 2017-04-01
1 210.0 200 215 Alpha Co 2017-04-01
2 90.0 50 95 Blue Inc 2017-04-01
Use the copy_to_sql() function to create a new Vantage table "pandas_sales" based on the pandas DataFrame "pdf" that is created in the prerequisite.
>>>copy_to_sql(df = pdf, table_name = "pandas_sales", primary_index="accounts", if_exists="replace")
Verify that the new table "pandas_sales" exists in Vantage using the DataFrame() function which creates a DataFrame based on an existing table.
>>>df = DataFrame("pandas_sales")
>>> df
Feb Jan Mar datetime
accounts
Jones LLC 200.0 150 140 2017-04-01 00:00:00.000000
Alpha Co 210.0 200 215 2017-04-01 00:00:00.000000
Blue Inc 90.0 50 95 2017-04-01 00:00:00.000000
Example 2: Use the optional index and index_label parameters, and use the index as Primary Index
>>> copy_to_sql(df = pdf, table_name = "pandas_sales", index=True, index_label="idx", primary_index="idx", if_exists="replace")
>>> df = DataFrame("pandas_sales")
>>> df
Feb Jan Mar accounts datetime
idx
0 200.0 150 140 Jones LLC 2017-04-01 00:00:00.000000
2 90.0 50 95 Blue Inc 2017-04-01 00:00:00.000000
1 210.0 200 215 Alpha Co 2017-04-01 00:00:00.000000
Example 3: Create a new table in Vantage based on a teradataml DataFrame
>>> df = DataFrame("sales")
>>> df
Feb Jan Mar Apr datetime
accounts
Blue Inc 90.0 50 95 101 04/01/2017
Alpha Co 210.0 200 215 250 04/01/2017
Jones LLC 200.0 150 140 180 04/01/2017
Yellow Inc 90.0 None None None 04/01/2017
Orange Inc 210.0 None None 250 04/01/2017
Red Inc 200.0 150 140 None 04/01/2017
>>> copy_to_sql(df = df, table_name = "sales_df1", primary_index="accounts", if_exists="replace")
>>> df1 = DataFrame("sales_df1")
>>> df1
Feb Jan Mar Apr datetime
accounts
Blue Inc 90.0 50 95 101 17/01/04
Orange Inc 210.0 None None 250 17/01/04
Red Inc 200.0 150 140 None 17/01/04
Yellow Inc 90.0 None None None 17/01/04
Jones LLC 200.0 150 140 180 17/01/04
Alpha Co 210.0 200 215 250 17/01/04
Example 4: Create a new Primary Time Index Table in Vantage from a Pandas DataFrame
>>> copy_to_sql(df = pdf, table_name = "pandas_sales_pti", timecode_column="datetime", columns_list="accounts")
>>> pandas_sales_pti = DataFrame('pandas_sales_pti')
>>> pandas_sales_pti
TD_TIMECODE Feb Jan Mar
accounts
Jones LLC 2017-04-01 00:00:00.000000 200.0 150 140
Alpha Co 2017-04-01 00:00:00.000000 210.0 200 215
Blue Inc 2017-04-01 00:00:00.000000 90.0 50 95
Example 5: Save a Pandas DataFrame with VECTOR datatype
>>> import pandas as pd
>>> VECTOR_data = {
... 'id': [10, 11, 12, 13],
... 'array_col': ['1,1', '2,2', '3,3', '4,4']
... }
>>> df = pd.DataFrame(VECTOR_data)
>>> from teradatasqlalchemy import VECTOR
>>> copy_to_sql(df=df, table_name='my_vector_table', types={'array_col': VECTOR})
>>>
Example 6: Save a pandas DataFrame with partition_by
>>> df = {'emp_name': ['A1', 'A2', 'A3', 'A4'],
... 'emp_sage': [100, 200, 300, 400],
... 'emp_id': [133, 144, 155, 177],
... 'marks': [99.99, 97.32, 94.67, 91.00]
... }
>>> pandas_df = pd.DataFrame(df)
>>> copy_to_sql(df=pandas_df, table_name='my_table_5', if_exists='replace', ... primary_index=['emp_id'], ... partition_by='emp_id')
Example 7: Save a pandas DataFrame with partition_by_case
>>> copy_to_sql(df=pandas_df, table_name='my_table_6', if_exists='replace', ... primary_index=['emp_id'], ... partition_by_case='emp_id > 100, emp_id < 500')
Example 8: Save a pandas DataFrame with partition_by_range
>>> copy_to_sql(df=pandas_df, table_name='my_table_7', if_exists='replace', ... primary_index=['emp_id'], ... partition_by_range='emp_id BETWEEN 100 AND 500')
Example 9: Save a Pandas DataFrame with valid time columns of DATE type to a temporal table
>>> import pandas as pd
>>> df = pd.DataFrame({
... 'id': [1, 2, 3],
... 'start_date': pd.to_datetime(['2024-01-01', '2024-02-01', '2024-03-01']).date,
... 'end_date': pd.to_datetime(['2024-01-10', '2024-02-10', '2024-03-10']).date,
... 'description': ['a', 'b', 'c']
... })
>>> copy_to_sql(
... df=df,
... table_name='temporal_table_pandas_date',
... valid_time_columns=('start_date', 'end_date')
... )
Example 10: Save a Pandas DataFrame with valid time columns of TIMESTAMP type to a temporal table. Name the derived column as valid_time
>>> import pandas as pd
>>> df = pd.DataFrame({
... 'id': [1, 2, 3],
... 'start_time': pd.to_datetime(['2024-01-01 10:00:00', '2024-02-01 11:00:00', '2024-03-01 12:00:00']),
... 'end_time': pd.to_datetime(['2024-01-01 12:00:00', '2024-02-01 13:00:00', '2024-03-01 14:00:00']),
... 'description: ['a', 'b', 'c']
... })
>>> copy_to_sql(
... df=df,
... table_name='temporal_table_pandas_timestamp',
... valid_time_columns=('start_time', 'end_time'),
... derived_column='valid_time'
... )
Example 11: Save a teradataml DataFrame with valid time column of PERIOD type to a temporal table
>>> from teradatasqlalchemy.types import PERIOD_DATE
>>> load_example_data("teradataml", "Employee_roles")
>>> df = DataFrame('Employee_roles')
>>> copy_to_sql(
... df,
... table_name = 'employee_roles_temporal',
... valid_time_column='role_validity_period',
... types={'role_validity_period':PERIOD_DATE}
... )