map_row() Method | Teradata Python Package - map_row() Method - Teradata Vantage

Teradata® VantageCloud Lake

Deployment
VantageCloud
Edition
Lake
Product
Teradata Vantage
Published
January 2023
Language
English (United States)
Last Update
2024-04-03
dita:mapPath
phg1621910019905.ditamap
dita:ditavalPath
pny1626732985837.ditaval
dita:id
phg1621910019905

Use the map_row() method to apply a function to every row in a teradataml DataFrame and return a teradataml DataFrame.

Required arguments:
  • user_function: Specifies the user defined function to apply to each row in the teradataml DataFrame.

    This can be either a lambda function, a regular Python function, or an object of functools.partial.

    A non-lambda function can be passed only when the user defined function does not accept any arguments other than the mandatory input - the input row.

    A user can also use functools.partial and lambda functions for the same, when:
    • For lambda function, there is a need to pass positional, or keyword, or both arguments.
    • For functools.partial, there is a need to pass keyword arguments only.

    See the "Functions, Inputs and Outputs" section in map_row() and map_partition() Methods for details about the input and output of this argument.

Optional arguments:
  • exec_mode: Specifies the mode of execution for the user defined function.
    Permitted values:
    • IN-DB: Execute the function on data in the teradataml DataFrame in Vantage.

      This is the default value.

    • LOCAL: Execute the function locally on sample data (at most num_rows rows) from the teradataml DataFrame.
  • chunk_size: Specifies the number of rows to be read in a chunk in each iteration using an iterator to apply the user defined function to each row in the chunk.

    Varying the value passed to this argument affects the performance and the memory utilization. Default value is 1000.

  • num_rows: Specifies the maximum number of sample rows to use from the teradataml DataFrame to apply the user defined function to when exec_mode is 'LOCAL'.
  • returns: Specifies the output column definition corresponding to the output of user_function.

    When not specified, the function assumes that the names and types of the output columns are the same as those of the input.

    Do not use Teradata reserved keywords as column names unless the column names of output dataframe are an exact match of input dataframe. You can find the list or check if the string is a reserved keyword or not using the list_td_reserved_keywords()function.
  • delimiter: Specifies a delimiter to use when reading columns from a row and writing result columns. The default value is '\t.
    • This argument cannot be the same as quotechar argument.
    • This argument cannot be newline character '\n'.
  • quotechar: Specifies a character that forces all input and output of the user function to be quoted using this specified character.

    Using this argument enables the Analytics Database to distinguish between NULL fields and empty strings. A string with length zero is quoted, while NULL fields are not.

    If this character is found in the data, it will be escaped by a second quote character.

    • This argument cannot be the same as delimiter argument.
    • This argument cannot be newline character '\n'.
  • auth: Specifies an authorization to use when running the user_function.
  • charset: Specifies the character encoding for data.

    Permitted values are 'utf-16' and 'latin'.

  • data_order_column: Specifies the Order By columns for the teradataml DataFrame.

    Values to this argument can be provided as a list, if multiple columns are used for ordering.

    This argument is used in both cases: "is_local_order = True" and "is_local_order = False".

    is_local_order must be set to 'True' when data_order_column is used with data_hash_column.
  • is_local_order: Specifies a boolean value to determine whether the input data is to be ordered locally or not. with

    When this argument is set to 'False' (default), data_order_columnspecifies the order in which the values in a group, or partition, are sorted.

    When this argument is set to 'True', qualified rows on each AMP are ordered in preparation to be input to a table function.

    This argument is ignored, if data_order_column is None.

    • This argument cannot be specified along with data_partition_column.
    • When this argument is set to True, data_order_column must be specified, and the columns specified in data_order_column are used for local ordering.
  • nulls_first: Specifies a boolean value to determine whether NULLS are listed first or last during ordering.

    NULLS are listed first when this argument is set to 'True', and last when set to 'False'.

    This argument is ignored, if data_order_column is None.

  • sort_ascending: Specifies a boolean value to determine if the result set is to be sorted on the data_order_column column in ascending or descending order.

    The sorting is ascending when this argument is set to 'True', and descending when set to 'False'.

    This argument is ignored, if data_order_column is None.

This function returns:
  • teradataml DataFrame if exec_mode is "IN-DB"
  • Pandas DataFrame if exec_mode is "LOCAL".

The method also accepts the same arguments that Script accepts, except that with returns is optional and the method does not accept data, data_hash_column, and data_partition_column. When returns is not provided, the method assumes that the function's output has the columns with the same names and types as the input teradataml DataFrame.

Example Prerequisite

The examples use the 'admissions_train' dataset, calculates the average 'gpa' per partition based on the value in 'admitted' column.

  • Load the example data.
    >>> load_example_data("dataframe", "admissions_train")
  • Create a DataFrame.
    >>> df = DataFrame('admissions_train')
    >>> print(df)
       masters   gpa     stats programming  admitted
    id                                             
    5       no  3.44    Novice      Novice         0
    34     yes  3.85  Advanced    Beginner         0
    13      no  4.00  Advanced      Novice         1
    40     yes  3.95    Novice    Beginner         0
    22     yes  3.46    Novice    Beginner         0
    19     yes  1.98  Advanced    Advanced         0
    36      no  3.00  Advanced      Novice         0
    15     yes  4.00  Advanced    Advanced         1
    7      yes  2.33    Novice      Novice         1
    17      no  3.83  Advanced    Advanced         1

Example 1: Create a user defined function to increase the 'gpa' by the percentage provided

In this example, the input to and the output from the function is a Pandas Series object.

  1. Create a user defined function.
    >>> def increase_gpa(row, p=20):
            row['gpa'] = row['gpa'] + row['gpa'] * p/100
            return row
  2. Apply the user defined function to the DataFrame.

    Since the output of the user defined function expects the same columns with the same types, you can skip passing the returns argument.

    >>> increase_gpa_20 = df.map_row(increase_gpa)
  3. Print the result.
    >>> print(increase_gpa_20)
       masters    gpa     stats programming  admitted
    id                                             
    13      no  4.800  Advanced      Novice         1
    36      no  3.600  Advanced      Novice         0
    15     yes  4.800  Advanced    Advanced         1
    40     yes  4.740    Novice    Beginner         0
    22     yes  4.152    Novice    Beginner         0
    38     yes  3.180  Advanced    Beginner         1
    26     yes  4.284  Advanced    Advanced         1
    5       no  4.128    Novice      Novice         0
    7      yes  2.796    Novice      Novice         1
    19     yes  2.376  Advanced    Advanced         0

Example 2: Use the same user defined function with a lambda notation to pass the percentage 'p = 40'

  1. Apply the user defined function to the DataFrame with a lambda notation.
    >>> increase_gpa_40 = df.map_row(lambda row: increase_gpa(row, p = 40))
  2. Print the result.
    >>> print(increase_gpa_40)
       masters    gpa     stats programming  admitted
    id                                              
    5       no  4.816    Novice      Novice         0
    34     yes  5.390  Advanced    Beginner         0
    13      no  5.600  Advanced      Novice         1
    40     yes  5.530    Novice    Beginner         0
    22     yes  4.844    Novice    Beginner         0
    19     yes  2.772  Advanced    Advanced         0
    36      no  4.200  Advanced      Novice         0
    15     yes  5.600  Advanced    Advanced         1
    7      yes  3.262    Novice      Novice         1
    17      no  5.362  Advanced    Advanced         1

Example 3: Use the same user defined function with functools.partial to pass the percentage 'p = 50'

  1. Load the necessary module.
    >>> from functools import partial
  2. Apply the user defined function to the DataFrame with functools.partial.
    >>> increase_gpa_50 = df.map_row(partial(increase_gpa, p = 50))
  3. Print the result.
    >>> print(increase_gpa_50)
       masters    gpa     stats programming  admitted
    id                                              
    5       no  5.160    Novice      Novice         0
    34     yes  5.775  Advanced    Beginner         0
    13      no  6.000  Advanced      Novice         1
    40     yes  5.925    Novice    Beginner         0
    22     yes  5.190    Novice    Beginner         0
    19     yes  2.970  Advanced    Advanced         0
    36      no  4.500  Advanced      Novice         0
    15     yes  6.000  Advanced    Advanced         1
    7      yes  3.495    Novice      Novice         1
    17      no  5.745  Advanced    Advanced         1

Example 4: Use a lambda function to increase the 'gpa' by 100 percent, and return numpy ndarray

  1. Load the necessary module.
    >>> from numpy import asarray
  2. Create a lambda function.
    >>> increase_gpa_lambda = lambda row, p=20: asarray([row['id'], row['masters'], row['gpa'] + row['gpa'] * p/100, row['stats'], row['programming'], row['admitted']]
  3. Apply the lambda function to the DataFrame.
    >>> increase_gpa_100 = df.map_row(lambda row: increase_gpa_lambda(row, p=100))
  4. Print the result.
    >>> print(increase_gpa_100)
       masters   gpa     stats programming  admitted
    id                                             
    5       no  6.88    Novice      Novice         0
    34     yes  7.70  Advanced    Beginner         0
    13      no  8.00  Advanced      Novice         1
    40     yes  7.90    Novice    Beginner         0
    22     yes  6.92    Novice    Beginner         0
    19     yes  3.96  Advanced    Advanced         0
    36      no  6.00  Advanced      Novice         0
    15     yes  8.00  Advanced    Advanced         1
    7      yes  4.66    Novice      Novice         1
    17      no  7.66  Advanced    Advanced         1