Use the map_partition() method to apply a function to a group or partition of rows in a teradataml DataFrame and return a teradataml DataFrame.
- user_function: Specifies the user defined function to apply to each group or partition of rows in the teradataml DataFrame.
This can be either a lambda function, a regular Python function, or an object of functools.partial.
A non-lambda function can be passed only when the user defined function does not accept any arguments other than the mandatory input - the iterator on the partition of rows.
A user can also use functools.partial and lambda functions for the same, when:- For lambda function, there is a need to pass positional, or keyword, or both arguments.
- For functools.partial, there is a need to pass keyword arguments only.
See the "Functions, Inputs and Outputs" section in map_row() and map_partition() Methods for details about the input and output of this argument.
- exec_mode: Specifies the mode of execution for the user defined function. Permitted values:
- IN-DB: Execute the function on data in the teradataml DataFrame in Vantage.
This is the default value.
- LOCAL: Execute the function locally on sample data (at most num_rows rows) from the teradataml DataFrame.
- IN-DB: Execute the function on data in the teradataml DataFrame in Vantage.
- chunk_size: Specifies the number of rows to be read in a chunk in each iteration using an iterator to apply the user defined function to each row in the chunk.
Varying the value passed to this argument affects the performance and the memory utilization. Default value is 1000.
- num_rows: Specifies the maximum number of sample rows to use from the teradataml DataFrame to apply the user defined function to when exec_mode is 'LOCAL'.
- data_partition_column: Specifies the Partition By columns for the teradataml DataFrame.
Values to this argument can be provided as a list, if multiple columns are used for partition.
- data_hash_column: Specifies the column to be used for hashing.
The rows in the teradataml DataFrame are redistributed to AMPs based on the hash value of the column specified. The user_function then runs once on each AMP.
If there is no data_partition_column, then the entire result set, delivered by the function, constitutes a single group or partition.
- data_partition_column cannot be specified along with data_hash_column.
- data_partition_column cannot be specified when is_local_order is set to 'True'.
- is_local_order must be set to 'True' when data_order_column is used with data_hash_column.
The method also accepts the same arguments that Script accepts, except that with returns is optional and the method does not accept data, and accepts exactly one of data_hash_column and data_partition_column. When returns is not provided, the method assumes that the function's output has the columns with the same names and types as the input teradataml DataFrame.
Example Prerequisite
The examples use the 'admissions_train' dataset, calculates the average 'gpa' per partition based on the value in 'admitted' column.
- Load the example data.
>>> load_example_data("dataframe", "admissions_train")
- Create a DataFrame.
>>> df = DataFrame('admissions_train')
>>> print(df) masters gpa stats programming admitted id 5 no 3.44 Novice Novice 0 34 yes 3.85 Advanced Beginner 0 13 no 4.00 Advanced Novice 1 40 yes 3.95 Novice Beginner 0 22 yes 3.46 Novice Beginner 0 19 yes 1.98 Advanced Advanced 0 36 no 3.00 Advanced Novice 0 15 yes 4.00 Advanced Advanced 1 7 yes 2.33 Novice Novice 1 17 no 3.83 Advanced Advanced 1
Example 1: Create a user defined function to calculate the average 'gpa', by reading data in chunks
In this example, the function accepts a TextFileReader object to iterate on data in chunks. The return type of the function is a numpy ndarray.
- Load the module.
>>> from numpy import asarray
- Create a user defined function.
>>> def grouped_gpa_avg_iter(rows): admitted = None row_count = 0 gpa = 0 for chunk in rows: for _, row in chunk.iterrows(): row_count += 1 gpa += row['gpa'] if admitted is None: admitted = row['admitted'] if row_count > 0: return asarray([admitted, gpa/row_count])
- Apply the user defined function to the DataFrame.
>>> from teradatasqlalchemy.types import INTEGER, FLOAT
>>> avg_gpa_by_admitted = df.map_partition(grouped_gpa_avg_iter, returns = OrderedDict([('admitted', INTEGER()), ('avg_gpa', FLOAT())]), data_partition_column = 'admitted')
- Print the result.
>>> print(avg_gpa_by_admitted) avg_gpa admitted 1 3.533462 0 3.557143
Example 2: Create a user defined function to calculate the average 'gpa', by reading data into a pandas DataFrame
In this example, the data is read at once into a Pandas DataFrame. The function accepts a TextFileReader object to iterate on data in chunks. The return type of the function is a Pandas Series.
- Create a user defined function.
>>> def grouped_gpa_avg(rows): pdf = rows.read() if pdf.shape[0] > 0: return pdf[['admitted', 'gpa']].mean()
- Apply the user defined function to the DataFrame.
>>> avg_gpa_pdf = df.map_partition(grouped_gpa_avg, returns = OrderedDict([('admitted', INTEGER()),('avg_gpa', FLOAT())]), data_partition_column = 'admitted')
- Print the result.
>>> print(avg_gpa_pdf) avg_gpa admitted 1 3.533462 0 3.557143
Example 3: Use a lambda function to achieve the same result
In this example, the function accepts an accept an iterator (TextFileReader object) and returns the result which is of type Pandas Series.
- Apply the user defined function with a lambda notation.
>>> avg_gpa_pdf_lambda = df.map_partition(lambda rows: grouped_gpa_avg(rows), returns = OrderedDict([('admitted', INTEGER()), ('avg_gpa', FLOAT())]), data_partition_column = 'admitted')
- Print the result.
>>> print(avg_gpa_pdf_lambda) avg_gpa admitted 0 3.557143 1 3.533462
Example 4: Use a function that returns the input data
In this example, the function accepts an iterator (TextFileReader object) and returns the result which is of type Pandas DataFrame.
- Create a user defined function.
>>> def echo(rows): pdf = rows.read() if pdf is not None: return pdf
- Apply the user defined function.
>>> echo_out = df.map_partition(echo, data_partition_column = 'admitted')
- Print the result.
>>> print(echo_out) masters gpa stats programming admitted id 15 yes 4.00 Advanced Advanced 1 7 yes 2.33 Novice Novice 1 22 yes 3.46 Novice Beginner 0 17 no 3.83 Advanced Advanced 1 13 no 4.00 Advanced Novice 1 38 yes 2.65 Advanced Beginner 1 26 yes 3.57 Advanced Advanced 1 5 no 3.44 Novice Novice 0 34 yes 3.85 Advanced Beginner 0 40 yes 3.95 Novice Beginner 0