Setup
- Teradata recommends using the same Python version in both the database server and the environment where teradataml runs.
- The function requires dill package with same version in both the database server and the local environment.
- Teradata recommends using similar versions of Python libraries between the client machine and Analytics Database machine.
- The function being applied to the row or set of rows using map_row() or map_partition() must be defined in the current Python session. Any modules/packages being used by it must be available to use with Script Table Operator on the database servers. If the function is being imported from some package or module, that too must be available on the database server.
- Teradata recommends filling/replacing empty values in character columns with a known placeholder value for better readability with map_row() and map_parition() and avoid confusing NULL values with empty string
Execute Mode
Specifies the mode of execution for the user defined function.
- IN-DB: Execute the function on data in the teradataml DataFrame in Analytics Database and returns a teradataml DataFrame. (Default execution mode)
- LOCAL: Execute the function locally on sample data from the teradataml DataFrame and returns a Pandas DataFrame.
Input of Python Function
- pandas Series object corresponding to a row in the DataFrame when the method called is map_row()
- iterator (TextFileReader object) to read data from the partition of rows in chunks (pandas DataFrames) when the method called is map_partition()
As a result, the user function has access to the data to process in a familiar format. Design the functions to read from the Series object or iterator and manipulate the data accordingly.
Output of Python Function
- pandas DataFrame having the same number of columns as expected in the output.
- pandas Series representing a row in the output of the method and having the same number of columns as the expected in the output.
- numpy ndarray
- One-dimensional: represents a row in the output, having the same number of columns as expected in the output.
- Two-dimensional: represents a dataset (like a pandas DataFrame) having the same number of columns as expected in the output.
The object returned by the user function is printed to the standard output as delimited lines (rows), using the specified delimiter and quotechar.
If the user function prints the output directly to the standard output (instead of returning an object of the supported type), then it must take care of using the delimiter and quotechar, if and when specified, to format the output printed.
This data printed to the standard output then gets converted to and saved in a table in Analytics Database.
The table is deleted as a part of garbage collection as soon as a remove_context() call is issued. To persist these results, the DataFrame.to_sql() method can be used.