map_row() and map_partition() Methods | Teradata Python Package - map_row() and map_partition() Methods

map_row() and map_partition() Methods | Teradata Python Package - map_row() and map_partition() Methods - Teradata Package for Python

Teradata® Package for Python User Guide

Product

Teradata Package for Python

Release Number

17.00

Published

November 2021

Language

English (United States)

Last Update

2022-01-14

dita:mapPath

bol1585763678431.ditamap

dita:ditavalPath

ayr1485454803741.ditaval

dita:id

B700-4006

lifecycle

Product Category

Teradata Vantage

Use the map_row() and map_partition() methods to apply a Python function to each row or group of rows in a DataFrame, and return the result to the user as a teradataml DataFrame.

Python’s built-in map function allows the application of a function to items in a list. The map_row() and map_partition() methods are similar to it, allowing the application of a function to each row or each group (partition) of rows, respectively.

For a user to run Python code in-database, Advanced SQL Engine offers Script Table Operator (STO). Both the map_row() and map_partition() methods leverage STO though the Script object in teradataml.

To use STO, a user needs to:

Write the Python code to process the rows in a script.
Make sure that the script takes care of parsing the input data from a stream to be able to access the values in the individual columns.
Install this script in Vantage.
Run the STO query to invoke this script on the required dataset on Advanced SQL Engine.
(Optional) Remove the installed script.

The map_row() and map_partition() methods take care of all these steps on user's behalf, so that the user can focus on the Python function and not the logistics involved in being able to leverage STO to use it.

The Python function being applied to the row or group of rows using these two methods must be defined in the current Python session.
Any modules or packages being used by it must be available to use with Script Table Operator on the Vantage servers. If the function is being imported from some packages or modules, that too must also be available on the Vantage servers.
The versions of any modules or packages being used by the applied function must be same to or compatible with the version of the same packages installed on the Vantage servers.
Both methods use dill to rewrite the function to be applied in a script format, which is sensitive to version differences.
User must make sure that the version of dill on the client machine is the same as that of the one installed on the Vantage servers.

Functions, Inputs, and Outputs

Both methods return a teradataml DataFrame when run on the Advanced SQL Engine (exec_mode = 'IN-DB').

Inputs:

Both methods support user defined functions (regular Python function and lambda notation).

The user defined function can accept as many arguments as required, but must accept the following as its first (positional) argument:

When the method called is map_row(): a pandas Series object corresponding to a row in the DataFrame.
That is, a row in a Pandas DataFrame corresponding to the teradataml DataFrame to which it is to be applied.
When the method called is map_partition(): an iterator (TextFileReader object) to read data from the partition of rows in chunks (pandas DataFrames).
That is, an iterator on the partition of rows from the teradataml DataFrame represented as a Pandas DataFrame to which it is to be applied.

Thus, the user function has access to the data to process in a familiar format, without having to worry about how it will be streamed from the STO to the function itself.

Outputs:

The functions can either print the output directly to the standard output (just like STO) or return objects of supported types that can be printed to the standard output correctly.

User function can return an object of the following supported types:

pandas DataFrame: having the same number of columns as expected in the output.
pandas Series: represents a row in the output, having the same number of columns as expected in the output.
numpy ndarray:
- One-dimensional array: represents a row in the output, having the same number of columns as expected in the output.
- Two-dimensional array: represents a dataset (like a pandas DataFrame), every array contained in the outer array having the same number of columns as expected in the output.

Then the returned object is printed to the standard output as delimited lines (rows), using specified delimiter and quotechar.

If the user function prints the output directly to the standard output (instead of returning an object of the supported type), then it must take care of using the delimiter and quotechar, if and when specified, to format the output printed.

The data printed to the standard output then gets converted to and saved in a table on the Advanced SQL Engine. The table is deleted as a part of garbage collection as soon as a remove_context() call is issued. To persist the results, the DataFrame.to_sql() method can be used.

It is recommended that empty values in characters columns are filled or replaced by a known placeholder value for better readability with map_row() and map_function() to avoid confusing NULL values with empty string.

Testing Mode

Users can test scripts using local system environment by setting the execution mode to local (exec_mode ='local').

Both map_row() and map_partition() methods return a pandas DataFrame, if the exec_mode is set to 'local'.

The sample data which is used to test the scripts should contain at most the number of rows as specified by num_rows.