read_csv | Teradata Package for Python - read_csv - Teradata Package for Python

Teradata® Package for Python User Guide

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Teradata Package for Python
Release Number
20.00
Published
March 2024
Language
English (United States)
Last Update
2024-04-09
dita:mapPath
nvi1706202040305.ditamap
dita:ditavalPath
plt1683835213376.ditaval
dita:id
rkb1531260709148
Product Category
Teradata Vantage

The read_csv() function loads data from CSV file into Teradata Vantage. This function is used to quickly load large amounts of data in a table on Vantage using FastLoadCSV protocol.

When load data from CSV file, optional arguments can be used to identify fields in a CSV file.
  • sep specifies a single character string used to separate fields in a CSV file, with default value ' , '.
  • quotechar specifies a single character string used to quote fields in a CSV file, with default value ' " ' (double quote).
Considerations when using a CSV file:
  • Each record is on a separate line of the CSV file. Records are delimited by line breaks (CRLF).
  • The last record in the file may or may not have an ending line break.
  • The first line in the CSV must be header line. The header line lists the column names separated by the field separator (e.g. col1,col2,col3).
Limitations when using a CSV file with FastLoad:
  • read_csv function cannot load duplicate rows in a DataFrame if the table is a MULTISET table having primary index.
  • read_csv function does not support all Teradata Analytics Database data types.

    For example, target table having BLOB and CLOB data type columns cannot be loaded.

  • If there are some incorrect rows due to constraint violations, data type conversion errors, and so on, FastLoad protocol ignores those rows and inserts all valid rows.
  • Rows in the DataFrame that failed to get inserted are categorized into errors and warnings by FastLoad protocol and these errors and warnings are stored into respective error and warning tables by FastLoad API.

Teradata recommends using FastLoad protocol when number of rows to be loaded is at least 100,000. FastLoad opens multiple data transfer connections to the database, when the argument use_fastload is set to 'True'. The number of data transfer sessions can be set using argument open_sessions. If this argument is not set, by default, data transfer sessions opened by teradataml is the smaller of 8 and the number of available AMPs in Vantage.

See the CSV BATCH INSERTS section and FastLoad section of https://pypi.org/project/teradatasql/ for more information.

Required Arguments:
  • filepath: Specifies the CSV filepath including name of the file to load the data from.
  • table_name: Specifies the table name to load the data into.
  • types: Specifies the data types for requested columns to be saved in Vantage.

    The default value is None.

    This argument is optional when if_exists=append and non-PTI table already exists; and is required otherwise.
    This argument accepts a dictionary of column names and their required teradatasqlalchemy types as key-value pairs.
    This must be OrderedDict, if CSV file does not contain header.
Optional Arguments:
  • sep: Specifies a single character string used to separate fields in a CSV file, with default value ' , '.
  • quotechar: Specifies a single character string used to quote fields in a CSV file, with default value ' \" ' (double quote).
    • sep and quotechar cannot be line feed ('\\n') or carriage return ('\\r').
    • sep and quotechar cannot be the same.
    • Length of sep and quotechar should be 1.
  • schema_name: Specifies the name of the SQL schema in Vantage to write to.

    The default value is None, which means use the default Vantage schema.

  • if_exists: Specifies the action to take when table already exists in the database.
    Permitted values are:
    • 'fail': If table exists, raise TeradataMlException;
    • 'replace': If table exists, drop it, recreate it, and insert data;
    • 'append': If table exists, append the existing table.

    The default value is 'replace'.

  • primary_index: Specifies the column(s) to use as primary index while creating tables in Vantage. When None (default value), No Primary Index (NOPI) tables are created.

    The default value is None.

    For example:
    • primary_index = 'my_primary_index'
    • primary_index = ['my_primary_index1', 'my_primary_index2', 'my_primary_index3']
  • set_table: Specifies a flag to determine whether to create a SET or a MULTISET table.
    • When True, a SET table is created.
    • When False, a MULTISET table is created. This is the default value.
    • Specifying set_table=True also requires specifying primary_index.
    • Creating SET table (set_table=True) may result in loss of duplicate rows, if CSV contains duplicate rows.
    • This argument has no effect if the table already exists and if_exists='append'.
  • temporary: Specifies whether to create a table as volatile.

    The default value is False.

    When set to True:
    • FastloadCSV protocol is not used for loading the data.
    • schema_name is ignored.
  • primary_time_index_name: Specifies a name for the Primary Time Index (PTI) when the table is to be created as PTI table.
    This argument is not required or used when the table to be created is not a PTI table. It is ignored if specified without the timecode_column.
  • timecode_column: Specifies the column in the DataFrame that reflects the form of the timestamp data in the time series.

    This column will be the TD_TIMECODE column in the table created. It should be of SQL type TIMESTAMP(n), TIMESTAMP(n) WITH TIMEZONE, or DATE, corresponding to Python types datetime.datetime or datetime.date.

    • This argument is required when the DataFrame must be saved as a PTI table.
    • This argument is not required or used when the table to be created is not a PTI table.
    • When this argument is specified, an attempt to create a PTI table will be made.
    • If this argument is specified, primary_index is ignored.
  • timezero_date: Specifies the earliest time series data that the PTI will accept; a date that precedes the earliest date in the time series data.

    Value specified must be of the following format: DATE 'YYYY-MM-DD'. Default value is DATE '1970-01-01'.

    • This argument is used when the DataFrame must be saved as a PTI table.
    • This argument is not required or used when the table to be created is not a PTI table.
    • It is ignored if specified without the timecode_column.
  • timebucket_duration: Specifies a duration that serves to break up the time continuum in the time series data into discrete groups or buckets.

    Value specified must use the formal form time_unit(n), where n is a positive integer, and time_unit can be any of the following: CAL_YEARS, CAL_MONTHS, CAL_DAYS, WEEKS, DAYS, HOURS, MINUTES, SECONDS, MILLISECONDS, or MICROSECONDS.

    • This argument is required if columns_list is not specified or is None.
    • It is used when the DataFrame must be saved as a PTI table.
    • This argument is not required or used when the table to be created is not a PTI table.
    • It is ignored if specified without the timecode_column.
  • column_list: Specifies a list of one or more PTI table column names.
    • This argument is required if timebucket_duration not specified.
    • It is used when the CSV data must be saved as a PTI table.
    • This argument is not required or used when the table to be created is not a PTI table.
    • It is ignored if specified without the timecode_column.
  • sequence_column: Specifies the column of type Integer containing the unique identifier for time series data reading when they are not unique in time.
    • When specified, implies SEQUENCED, meaning more than one reading from the same sensor may have the same timestamp. This column will be the TD_SEQNO column in the table created.
    • When not specified, implies NONSEQUENCED, meaning there is only one sensor reading per timestamp. This is the default.
    • This argument is used when the CSV data must be saved as a PTI table.
    • This argument is not required or used when the table to be created is not a PTI table.
    • It is ignored if specified without the timecode_column.
  • seq_max: Specifies the maximum number of data rows that can have the same timestamp. Can be used when 'sequenced' is True.

    Permitted values range from 1 to 2147483647, with default value 20000.

    • This argument is used when the CSV data must be saved as a PTI table.
    • This argument is not required or used when the table to be created is not a PTI table.
    • It is ignored if specified without the timecode_column argument.
  • save_errors: Specifies whether to persist the error and warning information in Vantage or not.
    • If set to False, which is the default value,

      - Errors or warnings (if any) are not persisted into tables;

      - Errors table generated by FastloadCSV are not persisted.

    • If set to True,

      - The errors or warnings information is persisted and names of error and warning tables are returned. Otherwise, the function returns None for the names of the tables.

      - The errors tables generated by FastloadCSV are persisted and name of error tables are returned. Otherwise, the function returns None for the names of the tables.

  • catch_errors_warnings: Specifies whether to catch errors and warnings (if any) raised by FastLoad protocol while loading data into the Vantage table.

    When set to False, which is the default value, function does not catch any errors and warnings, otherwise catches errors and warnings, if any, and returns as a dictionary along with teradataml DataFrame.

    See the following Returns section for more details.

  • use_fastload: Specifies whether to use Fastload CSV protocol or not. Default value is True.

    Teradata recommends to use Fastload when number of rows to be loaded are at least 100,000. To load lesser rows set this argument to 'False'. Fastload opens multiple data transfer connections to the database.

    • When this argument is set to True, you can load the data into table using FastloadCSV protocol:

      - Set table

      - Multiset table

    • When this argument is set to False, you can load the data in following types of tables:

      - Set table

      - Multiset table

      - PTI table

      - Volatile table

  • open_sessions: Specifies the number of Teradata data transfer sessions to be opened for fastload operation.

    If this argument is not provided, the default value is the smaller of 8 or the number of AMPs available.

    See the FastLoad section of https://pypi.org/project/teradatasql/ for additional information about number of Teradata data transfer sessions opened during fastload.

Returns

Based on the input, the read_csv function returns different output:
  • If use_fastload is set to 'False', it returns teradataml DataFrame.
  • If use_fastload is set to 'True' and catch_errors_warnings is set to 'False', it returns only teradataml DataFrame.
  • If use_fastload is set to 'True' and catch_errors_warnings is set to 'True', it returns a tuple containing teradataml DataFrame and a dict containing the following attributes:
    • errors_dataframe: It is a Pandas DataFrame containing error messages thrown by fastload.

      DataFrame is empty if there are no errors.

    • warnings_dataframe: It is a Pandas DataFrame containing warning messages thrown by fastload.

      DataFrame is empty if there are no warnings.

    • errors_table: Name of the table containing errors.

      It is None, if argument save_errors is set to 'False'.

    • warnings_table: Name of the table containing warnings.

      It is None, if argument save_errors is set to 'False'.

    • fastloadcsv_error_tables: Name of the tables containing errors generated by FastLoadCSV.

      It is empty list, if argument save_errors is set to 'False'.

Example Setup

>>> import csv
>>> from collections import OrderedDict
>>> from teradataml.dataframe.data_transfer import read_csv
>>> from teradatasqlalchemy.types import *
>>> csv_filename = "test_file.csv"
>>> table_name = "read_csv_table"
>>> record = [
        'id,fname,lname,marks',
        '101,abc,def,200',
        '102,lmn,pqr,105',
        '103,def,xyz,492',
        '101,abc,def,300'
    ]
>>> # Function to write list data into csv file.
>>> def write_data_into_csv(filename, record):
        with open(filename, 'w', newline='') as csvFile:
            csvWriter = csv.writer(csvFile, delimiter='\n')
            csvWriter.writerow(record)
>>>
>>> write_data_into_csv(csv_filename, record)

Example 1: Default execution with types passed as OrderedDict

This example loads the data from CSV file into a table, with argument types passed as OrderedDict.


>>> read_csv('test_file.csv', 'my_first_table', types)>>> types = OrderedDict(id=BIGINT, fname=VARCHAR, lname=VARCHAR, marks=FLOAT)

Example 2: Catch all errors and warnings, and store in the table

This example loads the data from CSV file into a table using fastload CSV protocol, and catch all errors and warnings and store those in the table.

>>> types = OrderedDict(id=BIGINT, fname=VARCHAR, lname=VARCHAR, marks=FLOAT)
>>> read_csv(filepath='test_file.csv',
             table_name='my_first_table1', 
             types=types,
             save_errors=True, 
             catch_errors_warnings=True)

Example 3: Load data into an existing table, replace the table

This example loads the data from CSV file into a table using fastload CSV protocol. If table exists, then replace the table.

>>>> types = OrderedDict(id=BIGINT, fname=VARCHAR, lname=VARCHAR, marks=FLOAT)
>>> read_csv(filepath='test_file.csv',
             table_name='my_first_table',
             types=types, 
             if_exists='replace')

Example 4: Load data into an existing table, append the table

This example loads the data from CSV file into a table using fastload CSV protocol. If table exists in specified schema, then append the table.

>>> # Create new table in Vantage.
>>> read_csv(filepath='test_file.csv',
             table_name='my_first_table',
             types=types, 
             if_exists='fail')
>>> # Write different data into csv file.
>>> record = [
        'id,fname,lname,marks',
        '501,orl,mal,740'
    ]
>>> write_data_into_csv(csv_filename, record)
>>>
>>> # Append the existing table.
>>> read_csv(filepath='test_file.csv',
             table_name='my_first_table',
             if_exists='append')

Example 5: Load data into a SET table, catch all errors and warnings

This example loads the data from CSV file into a SET table using fastload CSV protocol. Catch all errors and warnings as well as store those in the table.

>>> # Write duplicate rows in csv file.
>>> record = [
           'id,fname,lname,marks',
            '101,abc,def,200',
            '102,lmn,pqr,105',
            '103,def,xyz,492',
            '101,abc,def,200'
        ]
>>> write_data_into_csv(csv_filename, record)
>>> # Create SET table.
>>> read_csv(filepath='test_file.csv',
             table_name='my_first_table',
             types=types, 
             if_exists='replace',
             set_table=True, 
             primary_index='id',
             save_errors=True, 
             catch_errors_warnings=True)

Example 6: Load data into a table without FastLoad protocol

This example loads the data from CSV file with DATE and TIMESTAMP columns into a table without FastLoad protocol. If table exists in specified schema, then append to the table.

>>> # Write date and timestamp in csv file.
>>> record = [
            'id,fname,lname,marks,admission_date,admission_time',
            '101,abc,def,200,2009-07-29,2009-07-29 20:17:59',
            '102,lmn,pqr,105,2009-07-29,2009-07-29 20:17:59',
            '103,def,xyz,492,2009-07-29,2009-07-29 20:17:59',
            '101,abc,def,200,2009-07-29,2009-07-29 20:17:59'
        ]
>>> write_data_into_csv(csv_filename, record)
>>> types = OrderedDict(id=BIGINT, fname=VARCHAR, lname=VARCHAR, marks=FLOAT, admission_date=DATE, admission_time=TIMESTAMP)
>>> read_csv(filepath='test_file.csv',
             table_name='my_first_table',
             types=types, 
             if_exists='replace',
             use_fastload=False)
>>> # Append the existing table.
>>> read_csv(filepath='test_file.csv',
             table_name='my_first_table',
             if_exists='append',
             use_fastload=False)

Example 7: Load data into a temporary table without FastLoad protocol

This example loads the data from CSV file into a temporary table without FastLoad protocol. If table exists in specified schema, then append to the table.

>>> types = OrderedDict(id=BIGINT, fname=VARCHAR, lname=VARCHAR, marks=FLOAT, admission_date=DATE, admission_time=TIMESTAMP)
>>> read_csv(filepath='test_file.csv',
             table_name='my_first_table',
             types=types, 
             if_exists='replace',
             temporary=True)
>>> # Append the existing table.
>>> read_csv(filepath='test_file.csv',
             table_name='my_first_table',
             if_exists='append',
             temporary=True)

Example 8: Load data into a PTI table

This example loads data from CSV file with TIMESTAMP columns into a PTI table. If specified table exists, then append to the table; otherwise, creates new table.

>>> types = OrderedDict(partition_id=INTEGER, adid=INTEGER, productid=INTEGER, event=VARCHAR, clicktime=TIMESTAMP)
>>> read_csv(filepath='test_file.csv',
             table_name='my_first_read_csv_pti_table',
             types=types, 
             if_exists='append',
             timecode_column='clicktime',
             columns_list='event',
             use_fastload=False)

Example 9: Load data into a SET PTI table

This example loads data from CSV file with TIMESTAMP columns into a SET PTI table. If specified table exists, then append to the table; otherwise, creates new table.

>>> types = OrderedDict(partition_id=INTEGER, adid=INTEGER, productid=INTEGER, event=VARCHAR, clicktime=TIMESTAMP)
>>> read_csv(filepath='test_file.csv',
             table_name='my_first_read_csv_pti_table',
             types=types, 
             if_exists='append',
             timecode_column='clicktime',
             columns_list='event',
             set_table=True)

Example 10: Load data into a temporary PTI table

This example loads data from CSV file with TIMESTAMP columns into a temporary PTI table. If specified table exists, then append to the table; otherwise, creates new table.

>>> types = OrderedDict(partition_id=INTEGER, adid=INTEGER, productid=INTEGER, event=VARCHAR, clicktime=TIMESTAMP)
>>> read_csv(filepath='test_file.csv',
             table_name='my_first_read_csv_pti_table',
             types=types, 
             if_exists='append',
             timecode_column='clicktime',
             columns_list='event',
             temporary=True)

Example 11: Load data into a table by opening given number of Teradata data transfer sessions

This example loads the data from CSV file into a Vantage table by opening two Teradata data transfer sessions.

>>> types = OrderedDict(id=BIGINT, fname=VARCHAR, lname=VARCHAR, marks=FLOAT)
>>> read_csv(filepath='test_file.csv',  
             table_name='my_first_table_with_open_sessions',
             types=types, 
             open_sessions=2)

Example 12: Load data into table and set primary index

This example loads data from CSV file into a Vantage table and set primary index provided through the primary_index argument.

>>> types = OrderedDict(id=BIGINT, fname=VARCHAR, lname=VARCHAR, marks=FLOAT)
>>> read_csv(filepath='test_file.csv',  
             table_name='my_first_table_with_primary_index',
             types=types, 
             primary_index = ['fname'])