Use the map_row() method to apply a function to every row in a teradataml DataFrame and return a teradataml DataFrame.
- user_function: Specifies the user defined function to apply to each row in the teradataml DataFrame.
This can be either a lambda function, a regular Python function, or an object of functools.partial.
A non-lambda function can be passed only when the user defined function does not accept any arguments other than the mandatory input - the input row.
A user can also use functools.partial and lambda functions for the same, when:- For lambda function, there is a need to pass positional, or keyword, or both arguments.
- For functools.partial, there is a need to pass keyword arguments only.
See the "Functions, Inputs and Outputs" section in map_row() and map_partition() Methods for details about the input and output of this argument.
- exec_mode: Specifies the mode of execution for the user defined function. Permitted values:
- IN-DB: Execute the function on data in the teradataml DataFrame in Vantage.
This is the default value.
- LOCAL: Execute the function locally on sample data (at most num_rows rows) from the teradataml DataFrame.
- IN-DB: Execute the function on data in the teradataml DataFrame in Vantage.
- chunk_size: Specifies the number of rows to be read in a chunk in each iteration using an iterator to apply the user defined function to each row in the chunk.
Varying the value passed to this argument affects the performance and the memory utilization. Default value is 1000.
- num_rows: Specifies the maximum number of sample rows to use from the teradataml DataFrame to apply the user defined function to when exec_mode is 'LOCAL'.
- returns: Specifies the output column definition corresponding to the output of user_function.
When not specified, the function assumes that the names and types of the output columns are the same as those of the input.
Do not use Teradata reserved keywords as column names unless the column names of output dataframe are an exact match of input dataframe. You can find the list or check if the string is a reserved keyword or not using the list_td_reserved_keywords()function. - delimiter: Specifies a delimiter to use when reading columns from a row and writing result columns. The default value is '\t.
- This argument cannot be the same as quotechar argument.
- This argument cannot be newline character '\n'.
- quotechar: Specifies a character that forces all input and output of the user function to be quoted using this specified character.
Using this argument enables the Analytics Database to distinguish between NULL fields and empty strings. A string with length zero is quoted, while NULL fields are not.
If this character is found in the data, it will be escaped by a second quote character.
- This argument cannot be the same as delimiter argument.
- This argument cannot be newline character '\n'.
- auth: Specifies an authorization to use when running the user_function.
- charset: Specifies the character encoding for data.
Permitted values are 'utf-16' and 'latin'.
- data_order_column: Specifies the Order By columns for the teradataml DataFrame.
Values to this argument can be provided as a list, if multiple columns are used for ordering.
This argument is used in both cases: "is_local_order = True" and "is_local_order = False".
is_local_order must be set to 'True' when data_order_column is used with data_hash_column. - is_local_order: Specifies a boolean value to determine whether the input data is to be ordered locally or not. with
When this argument is set to 'False' (default), data_order_columnspecifies the order in which the values in a group, or partition, are sorted.
When this argument is set to 'True', qualified rows on each AMP are ordered in preparation to be input to a table function.
This argument is ignored, if data_order_column is None.
- This argument cannot be specified along with data_partition_column.
- When this argument is set to True, data_order_column must be specified, and the columns specified in data_order_column are used for local ordering.
- nulls_first: Specifies a boolean value to determine whether NULLS are listed first or last during ordering.
NULLS are listed first when this argument is set to 'True', and last when set to 'False'.
This argument is ignored, if data_order_column is None.
- sort_ascending: Specifies a boolean value to determine if the result set is to be sorted on the data_order_column column in ascending or descending order.
The sorting is ascending when this argument is set to 'True', and descending when set to 'False'.
This argument is ignored, if data_order_column is None.
- teradataml DataFrame if exec_mode is "IN-DB"
- Pandas DataFrame if exec_mode is "LOCAL".
The method also accepts the same arguments that Script accepts, except that with returns is optional and the method does not accept data, data_hash_column, and data_partition_column. When returns is not provided, the method assumes that the function's output has the columns with the same names and types as the input teradataml DataFrame.
Example Prerequisite
The examples use the 'admissions_train' dataset, calculates the average 'gpa' per partition based on the value in 'admitted' column.
- Load the example data.
>>> load_example_data("dataframe", "admissions_train")
- Create a DataFrame.
>>> df = DataFrame('admissions_train')
>>> print(df) masters gpa stats programming admitted id 5 no 3.44 Novice Novice 0 34 yes 3.85 Advanced Beginner 0 13 no 4.00 Advanced Novice 1 40 yes 3.95 Novice Beginner 0 22 yes 3.46 Novice Beginner 0 19 yes 1.98 Advanced Advanced 0 36 no 3.00 Advanced Novice 0 15 yes 4.00 Advanced Advanced 1 7 yes 2.33 Novice Novice 1 17 no 3.83 Advanced Advanced 1
Example 1: Create a user defined function to increase the 'gpa' by the percentage provided
In this example, the input to and the output from the function is a Pandas Series object.
- Create a user defined function.
>>> def increase_gpa(row, p=20): row['gpa'] = row['gpa'] + row['gpa'] * p/100 return row
- Apply the user defined function to the DataFrame.
Since the output of the user defined function expects the same columns with the same types, you can skip passing the returns argument.
>>> increase_gpa_20 = df.map_row(increase_gpa)
- Print the result.
>>> print(increase_gpa_20) masters gpa stats programming admitted id 13 no 4.800 Advanced Novice 1 36 no 3.600 Advanced Novice 0 15 yes 4.800 Advanced Advanced 1 40 yes 4.740 Novice Beginner 0 22 yes 4.152 Novice Beginner 0 38 yes 3.180 Advanced Beginner 1 26 yes 4.284 Advanced Advanced 1 5 no 4.128 Novice Novice 0 7 yes 2.796 Novice Novice 1 19 yes 2.376 Advanced Advanced 0
Example 2: Use the same user defined function with a lambda notation to pass the percentage 'p = 40'
- Apply the user defined function to the DataFrame with a lambda notation.
>>> increase_gpa_40 = df.map_row(lambda row: increase_gpa(row, p = 40))
- Print the result.
>>> print(increase_gpa_40) masters gpa stats programming admitted id 5 no 4.816 Novice Novice 0 34 yes 5.390 Advanced Beginner 0 13 no 5.600 Advanced Novice 1 40 yes 5.530 Novice Beginner 0 22 yes 4.844 Novice Beginner 0 19 yes 2.772 Advanced Advanced 0 36 no 4.200 Advanced Novice 0 15 yes 5.600 Advanced Advanced 1 7 yes 3.262 Novice Novice 1 17 no 5.362 Advanced Advanced 1
Example 3: Use the same user defined function with functools.partial to pass the percentage 'p = 50'
- Load the necessary module.
>>> from functools import partial
- Apply the user defined function to the DataFrame with functools.partial.
>>> increase_gpa_50 = df.map_row(partial(increase_gpa, p = 50))
- Print the result.
>>> print(increase_gpa_50) masters gpa stats programming admitted id 5 no 5.160 Novice Novice 0 34 yes 5.775 Advanced Beginner 0 13 no 6.000 Advanced Novice 1 40 yes 5.925 Novice Beginner 0 22 yes 5.190 Novice Beginner 0 19 yes 2.970 Advanced Advanced 0 36 no 4.500 Advanced Novice 0 15 yes 6.000 Advanced Advanced 1 7 yes 3.495 Novice Novice 1 17 no 5.745 Advanced Advanced 1
Example 4: Use a lambda function to increase the 'gpa' by 100 percent, and return numpy ndarray
- Load the necessary module.
>>> from numpy import asarray
- Create a lambda function.
>>> increase_gpa_lambda = lambda row, p=20: asarray([row['id'], row['masters'], row['gpa'] + row['gpa'] * p/100, row['stats'], row['programming'], row['admitted']]
- Apply the lambda function to the DataFrame.
>>> increase_gpa_100 = df.map_row(lambda row: increase_gpa_lambda(row, p=100))
- Print the result.
>>> print(increase_gpa_100) masters gpa stats programming admitted id 5 no 6.88 Novice Novice 0 34 yes 7.70 Advanced Beginner 0 13 no 8.00 Advanced Novice 1 40 yes 7.90 Novice Beginner 0 22 yes 6.92 Novice Beginner 0 19 yes 3.96 Advanced Advanced 0 36 no 6.00 Advanced Novice 0 15 yes 8.00 Advanced Advanced 1 7 yes 4.66 Novice Novice 1 17 no 7.66 Advanced Advanced 1