- The function requires dill package with the same version in both remote environment and local environment.
- Teradata recommends to use the same Python version in both remote and local environment.
- user_function: Specifies the user defined function to apply to each row in the teradataml DataFrame.
- The function can be either a lambda function, a regular Python function, or an object of functools.partial.
- The first argument (positional) of the UDF must be a row in a pandas DataFrame corresponding to the teradataml DataFrame to which it is to be applied.
- A non-lambda function can be passed only when the UDF does not accept any arguments other than the mandatory input, which is the input row.A user can also use functools.partial and lambda functions for the same, which are especially handy in the following scenarios:
- There is a need to pass positional and/or keyword arguments (lambda)
- There is a need to pass keyword arguments only (functool.partial).
- The return type of the UDF must be one of the following:
- numpy ndarray
When one-dimensional, having the same number of values as output columns.
When two-dimensional, every array contained in the outer array having the same number of values as output columns.
- pandas Series
- pandas DataFrame
- numpy ndarray
- To use the apply method on the UDF, packages dill and pandas must be installed in remote user environment using install_lib function of UserEnv class.
- env_name: Specifies the name of the remote user environment or an object of class UserEnv.
- exec_mode: Specifies the mode of execution for the UDF.
Default value is 'REMOTE'.
- chunk_size: Specifies the number of rows to be read in a chunk in each iteration using an iterator to apply the UDF to each row in the chunk.
The value passed to this argument affects the performance and the memory utilized by the function. Default value is 1000.
- returns: Specifies output column definition corresponding to the output of user_function.
When not specified, the function assumes that the names and types of the output columns are the same as those of the input.
- delimiter: Specifies a delimiter to use when reading columns from a row and writing result columns.
The default value is comma (',').
- This argument cannot be the same as quotechar argument.
- This argument cannot be a newline character ('\n').
- quotechar: Specifies a character that forces all input and output of the user function to be quoted using this specified character.
Using this argument enables the Analytics Database to distinguish between NULL fields and empty strings. That is, a string with length zero is quoted, while NULL fields are not.
If this character is found in the data, it is escaped by a second quote character.
- This argument cannot be the same as delimiter argument.
- This argument cannot be a newline character ('\n').
- data_partition_column: Specifies Partition By columns for data.
Values to this argument can be provided as a list, if multiple columns are used for partition.
If there is no data_partition_column, then the entire result set delivered by the function, constitutes a single group or partition.
- data_partition_column can not be specified along with data_hash_column.
- data_partition_column can not be specified along with "is_local_order = True".
- data_hash_column: Specifies the column to be used for hashing.
The rows in the input data are redistributed to AMPs based on the hash value of the column specified.
If there is no data_hash_column, then the entire result set, delivered by the function, constitutes a single group or partition.
- data_hash_column can not be specified along with data_partition_column.
- data_hash_column can not be specified along with "is_local_order=False' and data_order_column.
- data_order_column: Specifies the Order By column for data. Values to this argument can be provided as a list, if multiple columns are used for ordering.
This argument can be used no matter is_local_order is set to 'True' or 'False'.
data_order_column can not be specified along with data_hash_column. - is_local_order: Specifies a boolean value to determine whether the input data is to be ordered locally or not.
data_order_column with is_local_order set to 'False' specifies the order in which the values in a group, or partition, are sorted.
Default value is 'False'. When set to 'True', qualified rows on each AMP are ordered in preparation to be input to a table function.
This argument is ignored, if data_order_column is None, .When is_local_order is set to 'True', you must specify data_order_column, and the columns specified in data_order_column are used for local ordering. - sort_ascending: Specifies a boolean value to determine if the result set is to be sorted on the data_order_column column in ascending or descending order.
- When set to the default value 'True', data is sorted in ascending order.
- When set to 'False', data is sorted in descending order.
This argument is ignored, if data_order_column is None. - nulls_first: Specifies a boolean value to determine whether NULLS are listed first or last during ordering.
NULLS are listed first when this argument is set to default value 'True', and listed last when set to 'False'.
This argument is ignored, if data_order_column is None. - style: Specifies how input is passed to and output is generated by the apply_command respectively. Default value is 'csv', and this is the only value supported for this argument.
Examples Setup
- Create a Python 3.7.9 environment.
>>> env = create_env('testenv', 'python_3.7.9', 'Test environment') User environment testenv created.
- Install packages dill and pandas in remote user environment.
>>> env.install_lib(['pandas','dill']) Request to install libraries initiated successfully in the remote user environment demo_env. Check the status using status() with the claim id 'ef255030-1be2-4d4a-9d47-12cd4365a003'.
- Check the status on the installation.
>>> env.status('ef255030-1be2-4d4a-9d47-12cd4365a003') Claim Id File/Libs Method Name Stage Timestamp Additional Details 0 ef255030-1be2-4d4a-9d47-12cd4365a003 pandas, dill install_lib Started 2022-08-04T04:27:56Z 1 ef255030-1be2-4d4a-9d47-12cd4365a003 pandas, dill install_lib Finished 2022-08-04T04:29:12Z
- Load example dataset 'admission_train'.
>>> load_example_data("dataframe", "admissions_train")
>>> df = DataFrame('admissions_train') >>> print(df) masters gpa stats programming admitted id 22 yes 3.46 Novice Beginner 0 36 no 3.00 Advanced Novice 0 15 yes 4.00 Advanced Advanced 1 38 yes 2.65 Advanced Beginner 1 5 no 3.44 Novice Novice 0 17 no 3.83 Advanced Advanced 1 34 yes 3.85 Advanced Beginner 0 13 no 4.00 Advanced Novice 1 26 yes 3.57 Advanced Advanced 1 19 yes 1.98 Advanced Advanced 0
The following examples uses the 'admissions_train' dataset, to increase the 'gpa' by a given percentage.
Example 1: Create a user defined function to increase the 'gpa' by a given percentage
- Create a user defined function.
>>> def increase_gpa(row, p=20): row['gpa'] = row['gpa'] + row['gpa'] * p/100 return row
- Apply the user defined function to the DataFrame.Since the output of the user defined function expects the same columns with the same types, the 'returns' argument is skipped.
>>> increase_gpa_20 = df.apply(increase_gpa, env_name='testenv')
- Print the result.
>>> print(increase_gpa_20) masters gpa stats programming admitted id 22 yes 4.152 Novice Beginner 0 36 no 3.600 Advanced Novice 0 15 yes 4.800 Advanced Advanced 1 38 yes 3.180 Advanced Beginner 1 5 no 4.128 Novice Novice 0 17 no 4.596 Advanced Advanced 1 34 yes 4.620 Advanced Beginner 0 13 no 4.800 Advanced Novice 1 26 yes 4.284 Advanced Advanced 1 19 yes 2.376 Advanced Advanced 0
Example 2: Use the same user defined function with a lambda notation to pass the percentage, 'p = 40'
- Apply the user defined function to the DataFrame with a lambda notation.
>>> increase_gpa_40 = df.apply(lambda row: increase_gpa(row, p = 40), env_name='testenv')
- Print the result.
>>> print(increase_gpa_40) masters gpa stats programming admitted id 22 yes 4.844 Novice Beginner 0 36 no 4.200 Advanced Novice 0 15 yes 5.600 Advanced Advanced 1 38 yes 3.710 Advanced Beginner 1 5 no 4.816 Novice Novice 0 17 no 5.362 Advanced Advanced 1 34 yes 5.390 Advanced Beginner 0 13 no 5.600 Advanced Novice 1 26 yes 4.998 Advanced Advanced 1 19 yes 2.772 Advanced Advanced 0
Example 3: Use the same user defined function with functools.partial to pass the percentage, 'p = 50'
- Load required library.
>>> from functools import partial
- Apply the user defined function to the DataFrame with functools.partial.
>>> increase_gpa_50 = df.apply(partial(increase_gpa, p = 50), env_name='testenv')
- Print the result.
>>> print(increase_gpa_50) masters gpa stats programming admitted id 13 no 6.000 Advanced Novice 1 26 yes 5.355 Advanced Advanced 1 5 no 5.160 Novice Novice 0 19 yes 2.970 Advanced Advanced 0 15 yes 6.000 Advanced Advanced 1 40 yes 5.925 Novice Beginner 0 7 yes 3.495 Novice Novice 1 22 yes 5.190 Novice Beginner 0 36 no 4.500 Advanced Novice 0 38 yes 3.975 Advanced Beginner 1
Example 4: Use a lambda function to double 'gpa', and return numpy ndarray
- Load required library.
>>> from numpy import asarray
- Create a lambda function.
>>> inc_gpa_lambda = lambda row, p=20: asarray([row['id'], row['masters'], row['gpa'] + row['gpa'] * p/100, row['stats'], row['programming'], row['admitted']])
- Apply the lambda function to the DataFrame.
>>> increase_gpa_100 = df.apply(lambda row: inc_gpa_lambda(row, p=100), env_name='testenv')
- Print the result.
>>> print(increase_gpa_100) masters gpa stats programming admitted id 13 no 8.00 Advanced Novice 1 26 yes 7.14 Advanced Advanced 1 5 no 6.88 Novice Novice 0 19 yes 3.96 Advanced Advanced 0 15 yes 8.00 Advanced Advanced 1 40 yes 7.90 Novice Beginner 0 7 yes 4.66 Novice Novice 1 22 yes 6.92 Novice Beginner 0 36 no 6.00 Advanced Novice 0 38 yes 5.30 Advanced Beginner 1