map_partition() Method | Teradata Python Package - 17.00 - map_partition() Method - Teradata Package for Python

Teradata® Package for Python User Guide

Product
Teradata Package for Python
Release Number
17.00
Release Date
November 2021
Content Type
User Guide
Publication ID
B700-4006-070K
Language
English (United States)

Use the map_partition() method to apply a function to a group or partition of rows in a teradataml DataFrame and return a teradataml DataFrame.

Required arguments:
  • user_function: Specifies the user defined function to apply to each group or partition of rows in the teradataml DataFrame.

    This can be either a lambda function, a regular Python function, or an object of functools.partial.

    A non-lambda function can be passed only when the user defined function does not accept any arguments other than the mandatory input - the iterator on the partition of rows.

    A user can also use functools.partial and lambda functions for the same, when:
    • For lambda function, there is a need to pass positional, or keyword, or both arguments.
    • For functools.partial, there is a need to pass keyword arguments only.

    See the "Functions, Inputs and Outputs" section in map_row() and map_partition() Methods for details about the input and output of this argument.

Optional arguments:
  • exec_mode: Specifies the mode of execution for the user defined function.
    Permitted values:
    • IN-DB: Execute the function on data in the teradataml DataFrame in Vantage.

      This is the default value.

    • LOCAL: Execute the function locally on sample data (at most num_rows rows) from the teradataml DataFrame.
  • chunk_size: Specifies the number of rows to be read in a chunk in each iteration using an iterator to apply the user defined function to each row in the chunk.

    Varying the value passed to this argument affects the performance and the memory utilization. Default value is 1000.

  • num_rows: Specifies the maximum number of sample rows to use from the teradataml DataFrame to apply the user defined function to when exec_mode is 'LOCAL'.
  • data_partition_column: Specifies the Partition By columns for the teradataml DataFrame.

    Values to this argument can be provided as a list, if multiple columns are used for partition.

  • data_hash_column: Specifies the column to be used for hashing.

    The rows in the teradataml DataFrame are redistributed to AMPs based on the hash value of the column specified. The user_function then runs once on each AMP.

    If there is no data_partition_column, then the entire result set, delivered by the function, constitutes a single group or partition.

  • data_partition_column cannot be specified along with data_hash_column.
  • data_partition_column cannot be specified when is_local_order is set to 'True'.
  • is_local_order must be set to 'True' when data_order_column is used with data_hash_column.

The method also accepts the same arguments that Script accepts, except that with returns is optional and the method does not accept data, and accepts exactly one of data_hash_column and data_partition_column. When returns is not provided, the method assumes that the function's output has the columns with the same names and types as the input teradataml DataFrame.

Example Prerequisite

The examples use the 'admissions_train' dataset, calculates the average 'gpa' per partition based on the value in 'admitted' column.

  • Load the example data.
    >>> load_example_data("dataframe", "admissions_train")
  • Create a DataFrame.
    >>> df = DataFrame('admissions_train')
    >>> print(df)
       masters   gpa     stats programming  admitted
    id                                             
    5       no  3.44    Novice      Novice         0
    34     yes  3.85  Advanced    Beginner         0
    13      no  4.00  Advanced      Novice         1
    40     yes  3.95    Novice    Beginner         0
    22     yes  3.46    Novice    Beginner         0
    19     yes  1.98  Advanced    Advanced         0
    36      no  3.00  Advanced      Novice         0
    15     yes  4.00  Advanced    Advanced         1
    7      yes  2.33    Novice      Novice         1
    17      no  3.83  Advanced    Advanced         1

Example 1: Create a user defined function to calculate the average 'gpa', by reading data in chunks

In this example, the function accepts a TextFileReader object to iterate on data in chunks. The return type of the function is a numpy ndarray.

  1. Load the module.
    >>> from numpy import asarray
  2. Create a user defined function.
    >>> def grouped_gpa_avg_iter(rows):
            admitted = None
            row_count = 0
            gpa = 0
     
            for chunk in rows:
                for _, row in chunk.iterrows():
                    row_count += 1
                    gpa += row['gpa']
                    if admitted is None:
                        admitted = row['admitted']
     
            if row_count > 0:
                return asarray([admitted, gpa/row_count])
  3. Apply the user defined function to the DataFrame.
    >>> from teradatasqlalchemy.types import INTEGER, FLOAT
    
    >>> avg_gpa_by_admitted = df.map_partition(grouped_gpa_avg_iter,
                                               returns = OrderedDict([('admitted', INTEGER()),
                                                                      ('avg_gpa', FLOAT())]),
                                               data_partition_column = 'admitted')
  4. Print the result.
    >>> print(avg_gpa_by_admitted)
               avg_gpa
    admitted        
    1         3.533462
    0         3.557143

Example 2: Create a user defined function to calculate the average 'gpa', by reading data into a pandas DataFrame

In this example, the data is read at once into a Pandas DataFrame. The function accepts a TextFileReader object to iterate on data in chunks. The return type of the function is a Pandas Series.

  1. Create a user defined function.
    >>> def grouped_gpa_avg(rows):
           pdf = rows.read()
           if pdf.shape[0] > 0:
               return pdf[['admitted', 'gpa']].mean()
  2. Apply the user defined function to the DataFrame.
    >>> avg_gpa_pdf = df.map_partition(grouped_gpa_avg,
                                       returns = OrderedDict([('admitted', INTEGER()),('avg_gpa', FLOAT())]),
                                       data_partition_column = 'admitted')
  3. Print the result.
    >>> print(avg_gpa_pdf)
               avg_gpa
    admitted         
    1         3.533462
    0         3.557143

Example 3: Use a lambda function to achieve the same result

In this example, the function accepts an accept an iterator (TextFileReader object) and returns the result which is of type Pandas Series.

  1. Apply the user defined function with a lambda notation.
    >>> avg_gpa_pdf_lambda = df.map_partition(lambda rows: grouped_gpa_avg(rows),
                                             returns = OrderedDict([('admitted', INTEGER()),
                                                                    ('avg_gpa', FLOAT())]),
                                             data_partition_column = 'admitted')
  2. Print the result.
    >>> print(avg_gpa_pdf_lambda)
               avg_gpa
    admitted         
    0         3.557143
    1         3.533462

Example 4: Use a function that returns the input data

In this example, the function accepts an iterator (TextFileReader object) and returns the result which is of type Pandas DataFrame.

  1. Create a user defined function.
    >>> def echo(rows):
            pdf = rows.read()
            if pdf is not None:
                return pdf
  2. Apply the user defined function.
    >>> echo_out = df.map_partition(echo, data_partition_column = 'admitted')
  3. Print the result.
    >>> print(echo_out)
       masters   gpa     stats programming  admitted
    id                                             
    15     yes  4.00  Advanced    Advanced         1
    7      yes  2.33    Novice      Novice         1
    22     yes  3.46    Novice    Beginner         0
    17      no  3.83  Advanced    Advanced         1
    13      no  4.00  Advanced      Novice         1
    38     yes  2.65  Advanced    Beginner         1
    26     yes  3.57  Advanced    Advanced         1
    5       no  3.44    Novice      Novice         0
    34     yes  3.85  Advanced    Beginner         0
    40     yes  3.95    Novice    Beginner         0