Teradata Package for Python Function Reference on VantageCloud Lake - read_csv - Teradata Package for Python - Look here for syntax, methods and examples for the functions included in the Teradata Package for Python.
Teradata® Package for Python Function Reference on VantageCloud Lake
- Deployment
- VantageCloud
- Edition
- Lake
- Product
- Teradata Package for Python
- Release Number
- 20.00.00.03
- Published
- December 2024
- ft:locale
- en-US
- ft:lastEdition
- 2024-12-19
- dita:id
- TeradataPython_FxRef_Lake_2000
- Product Category
- Teradata Vantage
- teradataml.dataframe.data_transfer.read_csv = read_csv(filepath, table_name, types=None, sep=',', quotechar='"', schema_name=None, if_exists='replace', primary_index=None, set_table=False, temporary=False, primary_time_index_name=None, timecode_column=None, timebucket_duration=None, timezero_date=None, columns_list=None, sequence_column=None, seq_max=None, catch_errors_warnings=False, save_errors=False, use_fastload=True, open_sessions=None)
- The read_csv() API loads data from CSV file into Teradata Vantage.
Function can be used to quickly load large amounts of data in a table on Vantage
using FastloadCSV protocol.
Considerations when using a CSV file:
* Each record is on a separate line of the CSV file. Records are delimited
by line breaks (CRLF). The last record in the file may or may not have an
ending line break.
* First line in the CSV must be header line. The header line lists
the column names separated by the field separator (e.g. col1,col2,col3).
* Using a CSV file with FastLoad has limitations as follows:
1. read_csv API cannot load duplicate rows in the DataFrame if the table is a
MULTISET table having primary index.
2. read_csv API does not support all Teradata Advanced SQL Engine data types.
For example, target table having BLOB and CLOB data type columns cannot be
loaded.
3. If there are any incorrect rows, i.e. due to constraint violations, data type
conversion errors, etc., FastLoad protocol ignores those rows and inserts
all valid rows.
4. Rows in the DataFrame that failed to get inserted are categorized into errors
and warnings by FastLoad protocol and these errors and warnings are stored
into respective error and warning tables by FastLoad API.
5. Teradata recommends to use Fastload protocol when number of rows to be loaded
are at least 100,000. Fastload opens multiple data transfer connections to the
database.
For additional information about FastLoadCSV protocol through teradatasql driver,
please refer the CSV BATCH INSERTS section of https://pypi.org/project/teradatasql/#CSVBatchInserts
driver documentation for more information.
PARAMETERS:
filepath:
Required Argument.
Specifies the CSV filepath including name of the file to load the data from.
Types: String
table_name:
Required Argument.
Specifies the table name to load the data into.
Types: String
types:
Optional Argument when if_exists=append and non-PTI table already exists, Required otherwise.
Specifies the data types for requested columns to be saved in Vantage.
Keys of this dictionary should be the name of the columns and values should be
teradatasqlalchemy.types.
Default Value: None
Note:
* If specified when "if_exists" is set to append and table exists, then argument is ignored.
Types: OrderedDict
sep:
Optional Argument.
Specifies a single character string used to separate fields in a CSV file.
Default Value: ","
Notes:
* "sep" cannot be line feed ('\n') or carriage return ('\r').
* "sep" should not be same as "quotechar".
* Length of "sep" argument should be 1.
Types: String
quotechar:
Optional Argument.
Specifies a single character string used to quote fields in a CSV file.
Default Value: """
Notes:
* "quotechar" cannot be line feed ('\n') or carriage return ('\r').
* "quotechar" should not be same as "sep".
* Length of "quotechar" argument should be 1.
Types: String
schema_name:
Optional Argument.
Specifies the name of the database/schema in Vantage to write to.
Default Value: None (Uses default database/schema).
Types: String
if_exists:
Optional Argument.
Specifies the action to take when table already exists in Vantage.
Permitted Values: 'fail', 'replace', 'append'
- fail: If table exists, raise TeradataMlException.
- replace: If table exists, drop it, recreate it, and insert data.
- append: If table exists, append the existing table.
Default Value: replace
Types: String
primary_index:
Optional Argument.
Specifies which column(s) to use as primary index while creating table
in Vantage. When set to None, No Primary Index (NoPI) tables are created.
Default Value: None
Types: String or list of strings
Example:
primary_index = 'my_primary_index'
primary_index = ['my_primary_index1', 'my_primary_index2', 'my_primary_index3']
set_table:
Optional Argument.
Specifies a flag to determine whether to create a SET or a MULTISET table.
When set to True, a SET table is created, otherwise MULTISET table is created.
Default Value: False
Notes:
1. Specifying set_table=True also requires specifying primary_index.
2. Creating SET table (set_table=True) results in
a. loss of duplicate rows, if CSV contains any duplicate.
3. This argument has no effect if the table already exists and if_exists='append'.
Types: Boolean
temporary:
Optional Argument.
Specifies whether to create table as volatile.
Default Value: False
Notes:
When set to True
1. FastloadCSV protocol is not used for loading the data.
2. "schema_name" is ignored.
Types : Boolean
primary_time_index_name:
Optional Argument.
Specifies the name for the Primary Time Index (PTI) when the table
is to be created as PTI table.
Note:
This argument is not required or used when the table to be created
is not a PTI table. It will be ignored if specified without the "timecode_column".
Types: String
timecode_column:
Optional argument.
Required when the CSV data must be saved as a PTI table.
Specifies the column in the csv that reflects the form
of the timestamp data in the time series.
This column will be the TD_TIMECODE column in the table created.
It should be of SQL type TIMESTAMP(n), TIMESTAMP(n) WITH TIMEZONE, or DATE,
corresponding to Python types datetime.datetime or datetime.date.
Note:
When "timecode_column" argument is specified, an attempt to create a PTI table
will be made. This argument is not required when the table to be created
is not a PTI table. If this argument is specified, "primary_index" will be ignored.
Types: String
timezero_date:
Optional Argument.
Used when the CSV data must be saved as a PTI table.
Specifies the earliest time series data that the PTI table will accept,
a date that precedes the earliest date in the time series data.
Value specified must be of the following format: DATE 'YYYY-MM-DD'
Default Value: DATE '1970-01-01'.
Note:
This argument is not required or used when the table to be created
is not a PTI table. It will be ignored if specified without the "timecode_column".
Types: String
timebucket_duration:
Optional Argument.
Required if "columns_list" is not specified or is None.
Used when the CSV data must be saved as a PTI table.
Specifies a duration that serves to break up the time continum in
the time series data into discrete groups or buckets.
Specified using the formal form time_unit(n), where n is a positive
integer, and time_unit can be any of the following:
CAL_YEARS, CAL_MONTHS, CAL_DAYS, WEEKS, DAYS, HOURS, MINUTES,
SECONDS, MILLISECONDS, or MICROSECONDS.
Note:
This argument is not required or used when the table to be created
is not a PTI table. It will be ignored if specified without the "timecode_column".
Types: String
columns_list:
Optional Argument.
Used when the CSV data must be saved as a PTI table.
Required if "timebucket_duration" is not specified.
A list of one or more PTI table column names.
Note:
This argument is not required or used when the table to be created
is not a PTI table. It will be ignored if specified without the "timecode_column".
Types: String or list of Strings
sequence_column:
Optional Argument.
Used when the CSV data must be saved as a PTI table.
Specifies the column of type Integer containing the unique identifier for
time series data readings when they are not unique in time.
* When specified, implies SEQUENCED, meaning more than one reading from the same
sensor may have the same timestamp.
This column will be the TD_SEQNO column in the table created.
* When not specified, implies NONSEQUENCED, meaning there is only one sensor reading
per timestamp.
This is the default.
Note:
This argument is not required or used when the table to be created
is not a PTI table. It will be ignored if specified without the "timecode_column".
Types: String
seq_max:
Optional Argument.
Used when the CSV data must be saved as a PTI table.
Specifies the maximum number of data rows that can have the
same timestamp. Can be used when 'sequenced' is True.
Permitted range: 1 - 2147483647.
Default Value: 20000.
Note:
This argument is not required or used when the table to be created
is not a PTI table. It will be ignored if specified without the "timecode_column".
Types: Integer
save_errors:
Optional Argument.
Specifies whether to persist the errors/warnings(if any) information in Vantage
or not.
If "save_errors" is set to False:
1. Errors or warnings (in any) are not persisted into tables.
2. Errors table genarated by FastloadCSV are not persisted.
If "save_errors" is set to True:
1. The errors or warnings information is persisted and names of error and
warning tables are returned. Otherwise, the function returns None for
the names of the tables.
2. The errors tables generated by FastloadCSV are persisted and name of
error tables are returned. Otherwise, the function returns None for
the names of the tables.
Default Value: False
Types: Boolean
catch_errors_warnings:
Optional Argument.
Specifies whether to catch errors/warnings(if any) raised by fastload
protocol while loading data into the Vantage table.
When set to False, function does not catch any errors and warnings,
otherwise catches errors and warnings, if any, and returns
as a dictionary along with teradataml DataFrame.
Please see 'RETURNS' section for more details.
Default Value: False
Types: Boolean
use_fastload:
Optional Argument.
Specifies whether to use Fastload CSV protocol or not.
Default Value: True
Notes:
1. Teradata recommends to use Fastload when number of rows to be loaded
are atleast 100,000. To load lesser rows set this argument to 'False'.
Fastload opens multiple data transfer connections to the database.
2. When "use_fastload" is set to True, one can load the data into table
using FastloadCSV protocol:
a. Set table
b. Multiset table
3. When "use_fastload" is set to False, one can load the data in following
types of tables:
a. Set table
b. Multiset table
c. PTI table
d. Volatile table
Types: Boolean
open_sessions:
Optional Argument.
Specifies the number of Teradata data transfer sessions to be opened for fastload operation.
Note : If "open_sessions" argument is not provided, the default value is the smaller of 8 or the
number of AMPs available.
For additional information about number of Teradata data-transfer
sessions opened during fastload, please refer to:
https://pypi.org/project/teradatasql/#FastLoad
Default Value: None
Types: int
RETURNS:
When "use_fastload" is set to False, returns teradataml dataframe.
When "use_fastload" is set to True, read_csv() returns below:
When "catch_errors_warnings" is set to False, returns only teradataml dataframe.
When "catch_errors_warnings" is set to True, read_csv() returns a tuple containing:
a. teradataml DataFrame.
b. a dict containing the following attributes:
a. errors_dataframe: It is a Pandas DataFrame containing error messages
thrown by fastload. DataFrame is empty if there are no errors.
b. warnings_dataframe: It is a Pandas DataFrame containing warning messages
thrown by fastload. DataFrame is empty if there are no warnings.
c. errors_table: Name of the table containing errors. It is None, if
argument save_errors is False.
d. warnings_table: Name of the table containing warnings. It is None, if
argument save_errors is False.
e. fastloadcsv_error_tables: Name of the tables containing errors generated
by FastloadCSV. It is empty list, if argument "save_errors" is False.
RAISES:
TeradataMlException
EXAMPLES:
>>> from teradataml.dataframe.data_transfer import read_csv
>>> from teradatasqlalchemy.types import *
>>> from collections import OrderedDict
# Example 1: Default execution with types argument is passed as OrderedDict.
>>> types = OrderedDict(id=BIGINT, fname=VARCHAR, lname=VARCHAR, marks=FLOAT)
>>> read_csv('test_file.csv', 'my_first_table', types)
# Example 2: Load the data from CSV file into a table using fastload CSV protocol,
# while doing so catch all errors and warnings as well as store those in the table.
>>> types = OrderedDict(id=BIGINT, fname=VARCHAR, lname=VARCHAR, marks=FLOAT)
>>> read_csv(filepath='test_file.csv',
table_name='my_first_table1', types=types,
save_errors=True, catch_errors_warnings=True)
# Example 3: Load the data from CSV file into a table using fastload CSV protocol.
# If table exists, then replace the same. Catch all errors and warnings as well as
# store those in the table.
>>> types = OrderedDict(id=BIGINT, fname=VARCHAR, lname=VARCHAR, marks=FLOAT)
>>> read_csv(filepath='test_file.csv',
table_name='my_first_table',
types=types, if_exists='replace',
save_errors=True, catch_errors_warnings=True)
# Example 4: Load the data from CSV file into a table using fastload CSV protocol.
# If table exists in specified schema, then append the same. Catch all
# errors and warnings as well as store those in the table.
>>> types = OrderedDict(id=BIGINT, fname=VARCHAR, lname=VARCHAR, marks=FLOAT)
>>> read_csv(filepath='test_file.csv',
table_name='my_first_table',
types=types, if_exists='fail',
save_errors=True, catch_errors_warnings=True)
>>> read_csv(filepath='test_file.csv',
table_name='my_first_table',
if_exists='append',
save_errors=True, catch_errors_warnings=True)
# Example 5: Load the data from CSV file into a SET table using fastload CSV protocol.
# Catch all errors and warnings as well as store those in the table.
>>> types = OrderedDict(id=BIGINT, fname=VARCHAR, lname=VARCHAR, marks=FLOAT)
>>> read_csv(filepath='test_file.csv',
table_name='my_first_table',
types=types, if_exists='replace',
set_table=True, primary_index='id',
save_errors=True, catch_errors_warnings=True)
# Example 6: Load the data from CSV file into a temporary table without fastloadCSV protocol.
# If table exists, then append to the same.
>>> types = OrderedDict(id=BIGINT, fname=VARCHAR, lname=VARCHAR, marks=FLOAT)
>>> read_csv(filepath='test_file.csv',
table_name='my_first_table',
types=types, if_exists='replace',
temporary=True)
>>> read_csv(filepath='test_file.csv',
table_name='my_first_table',
if_exists='append',
temporary=True)
# Example 7: Load the data from CSV file with DATE and TIMESTAMP columns into
# a table without Fastload protocol. If table exists in specified
# schema, then append to the table.
>>> types = OrderedDict(id=BIGINT, fname=VARCHAR, lname=VARCHAR, marks=FLOAT,
admission_date=DATE, admission_time=TIMESTAMP)
>>> read_csv(filepath='test_file.csv',
table_name='my_first_table',
types=types, if_exists='fail',
use_fastload=False)
>>> read_csv(filepath='test_file.csv',
table_name='my_first_table',
if_exists='append',
use_fastload=False)
# Example 8: Load the data from CSV file with TIMESTAMP columns into
# a PTI table. If specified table exists then append to the table,
# otherwise creates new table.
>>> types = OrderedDict(partition_id=INTEGER, adid=INTEGER, productid=INTEGER,
event=VARCHAR, clicktime=TIMESTAMP)
>>> read_csv(filepath='test_file.csv',
table_name='my_first_read_csv_pti_table',
types=types, if_exists='append',
timecode_column='clicktime',
columns_list='event',
use_fastload=False)
# Example 9: Load the data from CSV file with TIMESTAMP columns into
# a SET PTI table. If specified table exists then append to the table,
# otherwise creates new table.
>>> types = OrderedDict(partition_id=INTEGER, adid=INTEGER, productid=INTEGER,
event=VARCHAR, clicktime=TIMESTAMP)
>>> read_csv(filepath='test_file.csv',
table_name='my_first_read_csv_pti_table',
types=types, if_exists='append',
timecode_column='clicktime',
columns_list='event',
set_table=True)
# Example 10: Load the data from CSV file with TIMESTAMP columns into
# a temporary PTI table. If specified table exists then append to the table,
# otherwise creates new table.
>>> types = OrderedDict(partition_id=INTEGER, adid=INTEGER, productid=INTEGER,
event=VARCHAR, clicktime=TIMESTAMP)
>>> read_csv(filepath='test_file.csv',
table_name='my_first_read_csv_pti_table',
types=types, if_exists='append',
timecode_column='clicktime',
columns_list='event',
temporary=True)
# Example 11: Load the data from CSV file into Vantage table by opening specified
# number of Teradata data transfer sesions.
>>> types = OrderedDict(id=BIGINT, fname=VARCHAR, lname=VARCHAR, marks=FLOAT)
>>> read_csv(filepath='test_file.csv', table_name='my_first_table_with_open_sessions',
types=types, open_sessions=2)
# Example 12: Load the data from CSV file into Vantage table and set primary index provided
# through primary_index argument.
>>> types = OrderedDict(id=BIGINT, fname=VARCHAR, lname=VARCHAR, marks=FLOAT)
>>> read_csv(filepath='test_file.csv', table_name='my_first_table_with_primary_index',
types=types, primary_index = ['fname'])