map_partition() Method | Teradata Python Package - map_partition() Method - Teradata Vantage

Teradata® VantageCloud Lake

Deployment
VantageCloud
Edition
Lake
Product
Teradata Vantage
Published
January 2023
Language
English (United States)
Last Update
2024-04-03
dita:mapPath
phg1621910019905.ditamap
dita:ditavalPath
pny1626732985837.ditaval
dita:id
phg1621910019905

Use the map_partition() method to apply a function to a group or partition of rows in a teradataml DataFrame and return a teradataml DataFrame.

Required arguments:
  • user_function: Specifies the user defined function to apply to each group or partition of rows in the teradataml DataFrame.

    This can be either a lambda function, a regular Python function, or an object of functools.partial.

    A non-lambda function can be passed only when the user defined function does not accept any arguments other than the mandatory input - the iterator on the partition of rows.

    A user can also use functools.partial and lambda functions for the same, when:
    • For lambda function, there is a need to pass positional, or keyword, or both arguments.
    • For functools.partial, there is a need to pass keyword arguments only.

    See the "Functions, Inputs and Outputs" section in map_row() and map_partition() Methods for details about the input and output of this argument.

Optional arguments:
  • exec_mode: Specifies the mode of execution for the user defined function.
    Permitted values:
    • IN-DB: Execute the function on data in the teradataml DataFrame in Vantage.

      This is the default value.

    • LOCAL: Execute the function locally on sample data (at most num_rows rows) from the teradataml DataFrame.
  • chunk_size: Specifies the number of rows to be read in a chunk in each iteration using an iterator to apply the user defined function to each row in the chunk.

    Varying the value passed to this argument affects the performance and the memory utilization. Default value is 1000.

  • num_rows: Specifies the maximum number of sample rows to use from the teradataml DataFrame to apply the user defined function to when exec_mode is 'LOCAL'.
  • data_partition_column: Specifies the Partition By columns for the teradataml DataFrame.

    Values to this argument can be provided as a list, if multiple columns are used for partition.

  • data_hash_column: Specifies the column to be used for hashing.

    The rows in the teradataml DataFrame are redistributed to AMPs based on the hash value of the column specified. The user_function then runs once on each AMP.

    If there is no data_partition_column, then the entire result set, delivered by the function, constitutes a single group or partition.

    • returns: Specifies the output column definition corresponding to the output of user_function.

      When not specified, the function assumes that the names and types of the output columns are the same as those of the input.

      Do not use Teradata reserved keywords as column names unless the column names of output dataframe are an exact match of input dataframe. You can find the list or check if the string is a reserved keyword or not using the list_td_reserved_keywords() function.
    • delimiter: Specifies a delimiter to use when reading columns from a row and writing result columns. The default value is '\t.
      • This argument cannot be the same as quotechar argument.
      • This argument cannot be newline character '\n'.
    • quotechar: Specifies a character that forces all input and output of the user function to be quoted using this specified character.

      Using this argument enables the Analytics Database to distinguish between NULL fields and empty strings. A string with length zero is quoted, while NULL fields are not.

      If this character is found in the data, it will be escaped by a second quote character.

      • This argument cannot be the same as delimiter argument.
      • This argument cannot be newline character '\n'.
    • auth: Specifies an authorization to use when running the user_function.
    • charset: Specifies the character encoding for data.

      Permitted values are 'utf-16' and 'latin'.

    • data_order_column: Specifies the Order By columns for the teradataml DataFrame.

      Values to this argument can be provided as a list, if multiple columns are used for ordering.

      This argument is used in both cases: "is_local_order = True" and "is_local_order = False".

      is_local_order must be set to 'True' when data_order_column is used with data_hash_column.
    • is_local_order: Specifies a boolean value to determine whether the input data is to be ordered locally or not. with

      When this argument is set to 'False' (default), data_order_columnspecifies the order in which the values in a group, or partition, are sorted.

      When this argument is set to 'True', qualified rows on each AMP are ordered in preparation to be input to a table function.

      This argument is ignored, if data_order_column is None.

      • This argument cannot be specified along with data_partition_column.
      • When this argument is set to True, specify data_order_column, and the columns specified in data_order_column are used for local ordering.
    • nulls_first: Specifies a boolean value to determine whether NULLS are listed first or last during ordering.

      NULLS are listed first when this argument is set to 'True', and last when set to 'False'.

      This argument is ignored, if data_order_column is None.

    • sort_ascending: Specifies a boolean value to determine if the result set is to be sorted on the data_order_column column in ascending or descending order.

      The sorting is ascending when this argument is set to 'True', and descending when set to 'False'.

      This argument is ignored, if data_order_column is None.

  • data_partition_column cannot be specified along with data_hash_column.
  • data_partition_column cannot be specified when is_local_order is set to 'True'.
  • is_local_order must be set to 'True' when data_order_column is used with data_hash_column.
This function returns:
  • teradataml DataFrame if exec_mode is "IN-DB"
  • Pandas DataFrame if exec_mode is "LOCAL".

The method also accepts the same arguments that Script accepts, except that with returns is optional and the method does not accept data, and accepts exactly one of data_hash_column and data_partition_column. When returns is not provided, the method assumes that the function's output has the columns with the same names and types as the input teradataml DataFrame.

Example Prerequisite

The examples use the 'admissions_train' dataset, calculates the average 'gpa' per partition based on the value in 'admitted' column.

  • Load the example data.
    >>> load_example_data("dataframe", "admissions_train")
  • Create a DataFrame.
    >>> df = DataFrame('admissions_train')
    >>> print(df)
       masters   gpa     stats programming  admitted
    id                                             
    5       no  3.44    Novice      Novice         0
    34     yes  3.85  Advanced    Beginner         0
    13      no  4.00  Advanced      Novice         1
    40     yes  3.95    Novice    Beginner         0
    22     yes  3.46    Novice    Beginner         0
    19     yes  1.98  Advanced    Advanced         0
    36      no  3.00  Advanced      Novice         0
    15     yes  4.00  Advanced    Advanced         1
    7      yes  2.33    Novice      Novice         1
    17      no  3.83  Advanced    Advanced         1

Example 1: Create a user defined function to calculate the average 'gpa', by reading data in chunks

In this example, the function accepts a TextFileReader object to iterate on data in chunks. The return type of the function is a numpy ndarray.

  1. Load the module.
    >>> from numpy import asarray
  2. Create a user defined function.
    >>> def grouped_gpa_avg_iter(rows):
            admitted = None
            row_count = 0
            gpa = 0
     
            for chunk in rows:
                for _, row in chunk.iterrows():
                    row_count += 1
                    gpa += row['gpa']
                    if admitted is None:
                        admitted = row['admitted']
     
            if row_count > 0:
                return asarray([admitted, gpa/row_count])
  3. Apply the user defined function to the DataFrame.
    >>> from teradatasqlalchemy.types import INTEGER, FLOAT
    
    >>> avg_gpa_by_admitted = df.map_partition(grouped_gpa_avg_iter,
                                               returns = OrderedDict([('admitted', INTEGER()),
                                                                      ('avg_gpa', FLOAT())]),
                                               data_partition_column = 'admitted')
  4. Print the result.
    >>> print(avg_gpa_by_admitted)
               avg_gpa
    admitted        
    1         3.533462
    0         3.557143

Example 2: Create a user defined function to calculate the average 'gpa', by reading data into a pandas DataFrame

In this example, the data is read at once into a Pandas DataFrame. The function accepts a TextFileReader object to iterate on data in chunks. The return type of the function is a Pandas Series.

  1. Create a user defined function.
    >>> def grouped_gpa_avg(rows):
           pdf = rows.read()
           if pdf.shape[0] > 0:
               return pdf[['admitted', 'gpa']].mean()
  2. Apply the user defined function to the DataFrame.
    >>> avg_gpa_pdf = df.map_partition(grouped_gpa_avg,
                                       returns = OrderedDict([('admitted', INTEGER()),('avg_gpa', FLOAT())]),
                                       data_partition_column = 'admitted')
  3. Print the result.
    >>> print(avg_gpa_pdf)
               avg_gpa
    admitted         
    1         3.533462
    0         3.557143

Example 3: Use a lambda function to achieve the same result

In this example, the function accepts an accept an iterator (TextFileReader object) and returns the result which is of type Pandas Series.

  1. Apply the user defined function with a lambda notation.
    >>> avg_gpa_pdf_lambda = df.map_partition(lambda rows: grouped_gpa_avg(rows),
                                             returns = OrderedDict([('admitted', INTEGER()),
                                                                    ('avg_gpa', FLOAT())]),
                                             data_partition_column = 'admitted')
  2. Print the result.
    >>> print(avg_gpa_pdf_lambda)
               avg_gpa
    admitted         
    0         3.557143
    1         3.533462

Example 4: Use a function that returns the input data

In this example, the function accepts an iterator (TextFileReader object) and returns the result which is of type Pandas DataFrame.

  1. Create a user defined function.
    >>> def echo(rows):
            pdf = rows.read()
            if pdf is not None:
                return pdf
  2. Apply the user defined function.
    >>> echo_out = df.map_partition(echo, data_partition_column = 'admitted')
  3. Print the result.
    >>> print(echo_out)
       masters   gpa     stats programming  admitted
    id                                             
    15     yes  4.00  Advanced    Advanced         1
    7      yes  2.33    Novice      Novice         1
    22     yes  3.46    Novice    Beginner         0
    17      no  3.83  Advanced    Advanced         1
    13      no  4.00  Advanced      Novice         1
    38     yes  2.65  Advanced    Beginner         1
    26     yes  3.57  Advanced    Advanced         1
    5       no  3.44    Novice      Novice         0
    34     yes  3.85  Advanced    Beginner         0
    40     yes  3.95    Novice    Beginner         0