Use the map_row() and map_partition() methods to apply a Python function to each row or group of rows in a DataFrame, and return the result to the user as a teradataml DataFrame.
Python’s built-in map function allows the application of a function to items in a list. The map_row() and map_partition() methods are similar to it, allowing the application of a function to each row or each group (partition) of rows, respectively.
For a user to run Python code in-database, Advanced SQL Engine offers Script Table Operator (STO). Both the map_row() and map_partition() methods leverage STO though the Script object in teradataml.
- Write the Python code to process the rows in a script.
- Make sure that the script takes care of parsing the input data from a stream to be able to access the values in the individual columns.
- Install this script in Vantage.
- Run the STO query to invoke this script on the required dataset on Advanced SQL Engine.
- (Optional) Remove the installed script.
- The Python function being applied to the row or group of rows using these two methods must be defined in the current Python session.
Any modules or packages being used by it must be available to use with Script Table Operator on the Vantage servers. If the function is being imported from some packages or modules, that too must also be available on the Vantage servers.
- The versions of any modules or packages being used by the applied function must be same to or compatible with the version of the same packages installed on the Vantage servers.
- Both methods use dill to rewrite the function to be applied in a script format, which is sensitive to version differences.
User must make sure that the version of dill on the client machine is the same as that of the one installed on the Vantage servers.
Functions, Inputs, and Outputs
Both methods return a teradataml DataFrame when run on the Advanced SQL Engine (exec_mode = 'IN-DB').
Inputs:
Both methods support user defined functions (regular Python function and lambda notation).
- When the method called is map_row(): a pandas Series object corresponding to a row in the DataFrame.
That is, a row in a Pandas DataFrame corresponding to the teradataml DataFrame to which it is to be applied.
- When the method called is map_partition(): an iterator (TextFileReader object) to read data from the partition of rows in chunks (pandas DataFrames).
That is, an iterator on the partition of rows from the teradataml DataFrame represented as a Pandas DataFrame to which it is to be applied.
Outputs:
The functions can either print the output directly to the standard output (just like STO) or return objects of supported types that can be printed to the standard output correctly.
- pandas DataFrame: having the same number of columns as expected in the output.
- pandas Series: represents a row in the output, having the same number of columns as expected in the output.
- numpy ndarray:
- One-dimensional array: represents a row in the output, having the same number of columns as expected in the output.
- Two-dimensional array: represents a dataset (like a pandas DataFrame), every array contained in the outer array having the same number of columns as expected in the output.
If the user function prints the output directly to the standard output (instead of returning an object of the supported type), then it must take care of using the delimiter and quotechar, if and when specified, to format the output printed.
The data printed to the standard output then gets converted to and saved in a table on the Advanced SQL Engine. The table is deleted as a part of garbage collection as soon as a remove_context() call is issued. To persist the results, the DataFrame.to_sql() method can be used.
Testing Mode
Users can test scripts using local system environment by setting the execution mode to local (exec_mode ='local').
Both map_row() and map_partition() methods return a pandas DataFrame, if the exec_mode is set to 'local'.
The sample data which is used to test the scripts should contain at most the number of rows as specified by num_rows.