fastload() | Teradata Package for Python - fastload() - Teradata Package for Python

Teradata® Package for Python User Guide

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Teradata Package for Python
Release Number
20.00
Published
December 2024
ft:locale
en-US
ft:lastEdition
2025-01-23
dita:mapPath
nvi1706202040305.ditamap
dita:ditavalPath
plt1683835213376.ditaval
dita:id
rkb1531260709148
lifecycle
latest
Product Category
Teradata Vantage

The fastload() function writes records from a Pandas DataFrame to Vantage using Fastload, and can be used to quickly load large amounts of data in an empty table on Vantage.

FastLoad opens multiple data transfer connections to the database. The number of data transfer sessions can be set using argument open_sessions. If this argument is not set, by default, data transfer sessions opened by teradataml is the smaller of 8 and the number of available AMPs in Vantage.

Teradata recommends using fastload() function when number of rows in the Pandas DataFrame is greater than 100,000 for better performance. To insert lesser rows, you can use the copy_to_sql() function for optimized performance. The data is loaded in batches.

  • FastLoad API cannot load duplicate rows in the DataFrame if the table is a MULTISET table with Primary Index.
  • FastLoad API does not support all Analytics Database data types.

    For example, target table having BLOB and CLOB data type columns cannot be loaded.

  • If there are any incorrect rows due to constraint violations, data type conversion errors, etc., FastLoad protocol ignores those rows and inserts all valid rows.
  • Rows in the DataFrame that failed to get inserted are categorized into errors and warnings by FastLoad protocol and these errors and warnings are stored into respective error and warning tables by FastLoad API.
  • fastload() creates 2 error tables when data is erroneous. These error tables are referred to as ERR_1 and ERR_2 tables.
    • ERR_1 table captures rows that violate the constraints or have format errors. It typically contains information about rows that could not be inserted into the target table due to data conversion errors, constraint violations, and so on.
    • ERR_2 table logs any duplicate rows found during the load process and which are not loaded in target table, since fastLoad does not allow duplicate rows to be loaded into the target table.
  • When save_errors argument is set to True, ERR_1 and ERR_2 tables are persisted. The fully qualified names of ERR_1, ERR_2 and warning tables are shown once the fastload operation is complete.
  • If user wants both error and warnings information from pandas dataframe to be persisted rather than that from ERR_1 and ERR_2 tables, then save_errors should be set to True and err_tbl_name must be provided.

See the FastLoad section of https://pypi.org/project/teradatasql/ for more information about FastLoad protocol through teradatasql driver.

Arguments

Required Arguments:
  • df: Specifies the pandas or teradataml DataFrame object to be saved in Vantage.
  • table_name: Specifies the name of the table to be created in Vantage.
Optional Arguments:
  • schema_name: Specifies the name of the SQL schema in Vantage to write to.

    The default value is None, which means use the default Vantage schema.

  • if_exists: Specifies the action to take when table already exists in the database.
    Possible values are:
    • 'fail': If table exists, do nothing;
    • 'replace': If table exists, drop it, recreate it, and insert data;
    • 'append': If table exists, insert data. Create if does not exist.

    The default value is 'replace'.

  • index: Specifies whether to save pandas DataFrame index as a column or not. Possible values are True or False.

    The default value is False.

  • index_label: Specifies the column labels for pandas DataFrame index columns.

    The default value is None.

  • primary_index: Specifies the column(s) to use as primary index while creating tables in Vantage. When None (default value), No Primary Index (NOPI) tables are created.
    For example:
    • primary_index = 'my_primary_index'
    • primary_index = ['my_primary_index1', 'my_primary_index2', 'my_primary_index3']
  • types: Specifies required data types for requested columns to be saved in Vantage.

    The default value is None.

    This argument accepts a dictionary of column names and their required teradatasqlalchemy types as key-value pairs (({column_name1: type_value1, ... column_nameN: type_valueN})), allowing to specify a subset of the columns of a specific type.

    • When only a subset of all columns are provided, the rest are defaulted to appropriate types.
    • When types argument is not provided, all column types are appropriately assigned and defaulted. The column types are assigned as listed in the following table.
      pandas/NumPy Type teradatasqlalchemy Type
      int32 INTEGER
      int64 BIGINT
      bool BYTEINT
      float32/float64 FLOAT
      datetime64/datatime64[ns] TIMESTAMP
      datatime64[ns,time_zone] TIMESTAMP(timezone=True)
      Any other data type VARCHAR(configure.default_varchar_size)
    • This argument does not have any effect when the table specified using table_name and schema_name exists and if_exists = 'append'.
  • batch_size: Specifies the number of rows to be loaded in a batch. The value of batch_size must be a positive integer.

    For better performance, Teradata recommends batch size to be at least 100,000.

    If this argument is None, which is the default value, there are two cases based on the number of rows, say N, in the dataframe 'df':
    • If N is greater than 100,000, the rows are divided into batches of equal size with each batch having at least 100,000 rows (except the last batch which might have more rows).
    • If N is less than 100,000, the rows are inserted in one batch after notifying the user that insertion happens with degradation of performance.

    If this argument is not None, the rows are inserted in batches of size given in the argument, irrespective of the recommended batch size.

    The last batch will have rows less than the batch size specified, if the number of rows is not an integral multiples of the argument batch_size.

  • save_errors: Specifies whether to persist the error and warning information in Vantage or not.
    • If set to True, ERR_1 and ERR_2 tables are persisted. The fully qualified names of ERR_1, ERR_2 and warning table are returned in a dictionary containing keys named as ERR_1_table, ERR_2_table, and warnings_table, respectively.
    • When save_errors is set to True and err_tbl_name is also provided, err_tbl_name takes precedence and error information is persisted into a single table using pandas Dataframe rather than in ERR_1 and ERR_2 tables.
    • When save_errors is set to False, errors and warnings information is not persisted as tables, but it is returned as pandas Dataframes in a dictionary containing keys named as errors_dataframe and warnings_dataframe, respectively.

    The default value is False.

  • open_sessions: Specifies the number of Teradata data transfer sessions to be opened for fastload operation.

    If this argument is not provided, the default value is the smaller of 8 or the number of AMPs available.

    See the FastLoad section of https://pypi.org/project/teradatasql/ for additional information about number of Teradata data transfer sessions opened during fastload.

  • err_tbl_1_suffix: Specifies the suffix for error table 1 created by a fastload job.

    The default value is _ERR_1.

  • err_tbl_2_suffix: Specifies the suffix for error table 2 created by a fastload job.

    The default value is _ERR_2.

  • err_tbl_name: Specifies the name for error table. This argument takes precedence over save_errors and saves error information in a single table, rather than ERR_1 and ERR_2 error tables.

    The default value is td_fl_<table_name>_err_<unique_id> where table_name is the name of the target/staging table and unique_id is the logon sequence number of the fastload job.

  • warn_tbl_name: Specifies the name for warning table.

    The default value is td_fl_<table_name>_warn_<unique_id> where table_name is the name of the target/staging table and unique_id is the logon sequence number of the fastload job.

  • err_staging_db: Specifies the name of the database to be used for creating staging table and error/warning tables.
    The current session user must have CREATE, DROP and INSERT table permissions on err_staging_db database.

Returns

FastLoad returns a dict containing the following attributes:
  • errors_dataframe: It is a Pandas DataFrame containing error messages thrown by fastload.

    DataFrame is empty if there are no errors or save_errors is set to True.

  • warnings_dataframe: It is a Pandas DataFrame containing warning messages thrown by fastload.

    DataFrame is empty if there are no warnings.

  • errors_table: Fully qualified name of the table containing errors..

    It is an empty string (''), if argument save_errors is set to False.

  • warnings_table: Fully qualified name of the table containing warnings.

    It is an empty string (''), if argument save_errors is set to False.

  • ERR_1_table: Fully qualified name of the ERR 1 table created by fastload job.

    It is an empty string (''), if argument save_errors is set to False.

  • ERR_2_table: Fully qualified name of the ERR 2 table created by fastload job.

    It is an empty string (''), if argument save_errors is set to False.

Minimum version requirements for fastload()

teradatasql version 16.20.00.48 or later is required for fastload() function to work properly. If you have a lower version installed, then teradatasql raises OperationalError and fastload() call ends with the following error:
[Teradata Database] [Error 3706] Syntax error: expected something between the beginning of the request and the word 'teradata_require_fastloadINSERT'.
pandas version 0.24 or later is required for fastload() API to work properly. If you have a lower version, fastload() API will fail with following error:
AttributeError: 'Index' object has no attribute 'to_list'
Install pandas >= 0.24 to solve this issue.

Example Setup

>>> from teradataml.dataframe.fastload import fastload
>>> from teradatasqlalchemy.types import *
>>> import pandas as pd
>>> df = {'emp_name': ['A1', 'A2', 'A3', 'A4'],
         'emp_sage': [100, 200, 300, 400],
         'emp_id': [133, 144, 155, 177],
         'marks': [99.99, 97.32, 94.67, 91.00]
          }
>>> pandas_df = pd.DataFrame(df)

Example 1: Save a Pandas DataFrame with default signature

>>> fastload(df = pandas_df, table_name = 'my_table')

Example 2: Save a Pandas DataFrame with primary_index

>>> pandas_df = pandas_df.set_index(['emp_id'])
>>> fastload(df = pandas_df, table_name = 'my_table_1', primary_index='emp_id')

Example 3: Save a Pandas DataFrame with index and primary_index

>>> fastload(df = pandas_df, table_name = 'my_table_2', index=True, primary_index='index_label')

Example 4: Save a Pandas DataFrame with types, appending to an existing table

>>> fastload(df = pandas_df, table_name = 'my_table_3', schema_name = 'alice', 
                             index = True, index_label = 'my_index_label', 
                             primary_index = ['emp_id'], if_exists = 'append', 
                             types = {'emp_name': VARCHAR, 'emp_sage':INTEGER, 
                             'emp_id': BIGINT, 'marks': DECIMAL})

Example 5: Save a Pandas DataFrame using levels in index of type MultiIndex

>>> pandas_df = pandas_df.set_index(['emp_id', 'emp_name'])
>>> fastload(df = pandas_df, table_name = 'my_table_4', schema_name = 'alice', 
                             index = True, index_label = ['index1', 'index2'], 
                             primary_index = ['index1'], if_exists = 'replace'

Example 6: Save a Pandas DataFrame by opening given number of Teradata data transfer sessions

This example saves a Pandas DataFrame by opening two Teradata data transfer sessions.

>>> fastload(df = pandas_df, table_name = 'my_table_5', open_sessions = 2)

Example 7: Save a Pandas Dataframe to a table in specified target database schema_name and errors and warnings to database specified with err_staging_db

This example saves errors to table named err_tbl_name and warnings to warn_tbl_name. This example assumes you are connected to a database different from schema_name and err_staging_db.

Create a pandas dataframe having one duplicate and one faulty row.

>>>> data_dict = {"C_ID": [301, 301, 302, 303, 304, 305, 306, 307, 308],
				 "C_timestamp": ['2014-01-06 09:01:25', '2014-01-06 09:01:25',
								 '2015-01-06 09:01:25.25.122200', '2017-01-06 09:01:25.11111',
								 '2013-01-06 09:01:25', '2019-03-06 10:15:28',
								 '2014-01-06 09:01:25.1098', '2014-03-06 10:01:02',
								 '2014-03-06 10:01:20.0000']}
>>> my_df = pd.DataFrame(data_dict)

Fastload data in non-default schema target_db and save errors and warnings in given tables.

>>> fastload(df=my_df, table_name='fastload_with_err_warn_tbl_stag_db',
			if_exists='replace', primary_index='C_ID', schema_name='target_db',
			types={'C_ID': INTEGER, 'C_timestamp': TIMESTAMP(6)},
			err_tbl_name='fld_errors', warn_tbl_name='fld_warnings',
			err_staging_db='stage_db')
Processed 9 rows in batch 1.
{'errors_dataframe':    batch_no                                      error_message
0         1   [Session 14527] [Teradata Database] [Error 26...,
'warnings_dataframe':         batch_no                                      error_message
0  batch_summary   [Session 14526] [Teradata SQL Driver] [Warnin...,
'errors_table': 'stage_db.fld_errors',
'warnings_table': 'stage_db.fld_warnings',
'ERR_1_table': '',
'ERR_2_table': ''}

Validate the loaded data table.

>>>  DataFrame(in_schema("target_db", "fastload_with_err_warn_tbl_stag_db"))
C_ID    C_timestamp
303     2017-01-06 09:01:25.111110
306     2014-01-06 09:01:25.109800
304     2013-01-06 09:01:25.000000
307     2014-03-06 10:01:02.000000
305     2019-03-06 10:15:28.000000
301     2014-01-06 09:01:25.000000
308     2014-03-06 10:01:20.000000

Validate the error and warning tables.

>>> DataFrame(in_schema("stage_db", "fld_errors"))
batch_no      error_message
1             [Session 14527] [Teradata Database] [Error 2673] FastLoad failed to insert 1 of 9 batched rows. Batched row 3 failed to insert because of Teradata Database error 2673 in "target_db"."fastload_with_err_warn_tbl_stag_db"."C_timestamp"
>>> DataFrame(in_schema("stage_db", "fld_warnings")
batch_no        error_message
batch_summary   [Session 14526] [Teradata SQL Driver] [Warning 518] Found 1 duplicate or faulty row(s) while ending FastLoad of database table "target_db"."fastload_with_err_warn_tbl_stag_db": expected a row count of 8, got a row count of 7

Example 8: Save a Pandas Dataframe to a table in specified target database schema_name and save errors in ERR_1 and ERR_2 tables having user defined suffixes provided in err_tbl_1_suffix and err_tbl_2_suffix

The source Pandas Dataframe is the same as Example 7.

>>> fastload(df = my_df, table_name = 'fastload_with_err_warn_tbl_stag_db',
			 schema_name = 'target_db',
if_exists = 'append',
			 types={'C_ID': INTEGER, 'C_timestamp': TIMESTAMP(6)},
			 err_staging_db='stage_db', save_errors=True,
			 err_tbl_1_suffix="_user_err_1", err_tbl_2_suffix="_user_err_2")
{'errors_dataframe': Empty DataFrame
 Columns: []
 Index: [],
 'warnings_dataframe':         batch_no
 0  batch_summary   [Session 14699] [Teradata SQL Driver] [Warnin...,
 'errors_table': '',
 'warnings_table': 'stage_db.td_fl_fastload_with_err_warn_tbl_stag_db_warn_1730',
 'ERR_1_table': 'stage_db.ml__fl_stag_1716272404181579_user_err_1',
 'ERR_2_table': 'stage_db.ml__fl_stag_1716272404181579_user_err_2'}

# Validate ERR_1 and ERR_2 tables.
>>> DataFrame(in_schema("stage_db", "ml__fl_stag_1716270574550744_user_err_1"))
ErrorCode	ErrorFieldName	DataParcel
2673	F_C_timestamp	b'12E...'

>>> DataFrame(in_schema("stage_db", "ml__fl_stag_1716270574550744_user_err_2"))
C_ID	C_timestamp