apply Method | Teradata Package for Python | Open Analytics Framework - apply Method

apply Method | Teradata Package for Python | Open Analytics Framework - apply Method - Teradata Vantage

Teradata® VantageCloud Lake

Deployment

VantageCloud

Edition

Lake

Product

Teradata Vantage

Published

January 2023

Language

English (United States)

Last Update

2024-04-03

dita:mapPath

phg1621910019905.ditamap

dita:ditavalPath

pny1626732985837.ditaval

dita:id

phg1621910019905

Use the apply() method to apply a user defined function (UDF) to each row in a teradataml DataFrame, leveraging APPLY table operator of Open Analytics Framework.

The function requires dill package with the same version in both remote environment and local environment.
Teradata recommends to use the same Python version in both remote and local environment.

Required arguments:

user_function: Specifies the user defined function to apply to each row in the teradataml DataFrame.
- The function can be either a lambda function, a regular Python function, or an object of functools.partial.
- The first argument (positional) of the UDF must be a row in a pandas DataFrame corresponding to the teradataml DataFrame to which it is to be applied.
- A non-lambda function can be passed only when the UDF does not accept any arguments other than the mandatory input, which is the input row.
  A user can also use functools.partial and lambda functions for the same, which are especially handy in the following scenarios:
  - There is a need to pass positional and/or keyword arguments (lambda)
  - There is a need to pass keyword arguments only (functool.partial).
- The return type of the UDF must be one of the following:
  - numpy ndarray
    When one-dimensional, having the same number of values as output columns.
    
    When two-dimensional, every array contained in the outer array having the same number of values as output columns.
  - pandas Series
  - pandas DataFrame
- To use the apply method on the UDF, packages dill and pandas must be installed in remote user environment using install_lib function of UserEnv class.
env_name: Specifies the name of the remote user environment or an object of class UserEnv.

Optional arguments:

exec_mode: Specifies the mode of execution for the UDF.
Default value is 'REMOTE'.
chunk_size: Specifies the number of rows to be read in a chunk in each iteration using an iterator to apply the UDF to each row in the chunk.
The value passed to this argument affects the performance and the memory utilized by the function. Default value is 1000.
returns: Specifies output column definition corresponding to the output of user_function.
When not specified, the function assumes that the names and types of the output columns are the same as those of the input.
delimiter: Specifies a delimiter to use when reading columns from a row and writing result columns.
The default value is comma (',').
- This argument cannot be the same as quotechar argument.
- This argument cannot be a newline character ('\n').
quotechar: Specifies a character that forces all input and output of the user function to be quoted using this specified character.
Using this argument enables the Analytics Database to distinguish between NULL fields and empty strings. That is, a string with length zero is quoted, while NULL fields are not.

If this character is found in the data, it is escaped by a second quote character.
- This argument cannot be the same as delimiter argument.
- This argument cannot be a newline character ('\n').
data_partition_column: Specifies Partition By columns for data.
Values to this argument can be provided as a list, if multiple columns are used for partition.

If there is no data_partition_column, then the entire result set delivered by the function, constitutes a single group or partition.
- data_partition_column can not be specified along with data_hash_column.
- data_partition_column can not be specified along with "is_local_order = True".
data_hash_column: Specifies the column to be used for hashing.
The rows in the input data are redistributed to AMPs based on the hash value of the column specified.

If there is no data_hash_column, then the entire result set, delivered by the function, constitutes a single group or partition.
- data_hash_column can not be specified along with data_partition_column.
- data_hash_column can not be specified along with "is_local_order=False' and data_order_column.
data_order_column: Specifies the Order By column for data. Values to this argument can be provided as a list, if multiple columns are used for ordering.
This argument can be used no matter is_local_order is set to 'True' or 'False'.

data_order_column can not be specified along with data_hash_column.
is_local_order: Specifies a boolean value to determine whether the input data is to be ordered locally or not.
data_order_column with is_local_order set to 'False' specifies the order in which the values in a group, or partition, are sorted.

Default value is 'False'. When set to 'True', qualified rows on each AMP are ordered in preparation to be input to a table function.

This argument is ignored, if data_order_column is None, .

When is_local_order is set to 'True', you must specify data_order_column, and the columns specified in data_order_column are used for local ordering.
sort_ascending: Specifies a boolean value to determine if the result set is to be sorted on the data_order_column column in ascending or descending order.
- When set to the default value 'True', data is sorted in ascending order.
- When set to 'False', data is sorted in descending order.
This argument is ignored, if data_order_column is None.
nulls_first: Specifies a boolean value to determine whether NULLS are listed first or last during ordering.
NULLS are listed first when this argument is set to default value 'True', and listed last when set to 'False'.

This argument is ignored, if data_order_column is None.
style: Specifies how input is passed to and output is generated by the apply_command respectively. Default value is 'csv', and this is the only value supported for this argument.

Examples Setup

Create a Python 3.7.9 environment.

>>> env = create_env('testenv', 'python_3.7.9', 'Test environment')
User environment testenv created.

Install packages dill and pandas in remote user environment.

>>> env.install_lib(['pandas','dill'])
Request to install libraries initiated successfully in the remote user environment demo_env. Check the status using status() with the claim id 'ef255030-1be2-4d4a-9d47-12cd4365a003'.

Check the status on the installation.

>>> env.status('ef255030-1be2-4d4a-9d47-12cd4365a003')
                               Claim Id     File/Libs  Method Name     Stage             Timestamp Additional Details
0  ef255030-1be2-4d4a-9d47-12cd4365a003  pandas, dill  install_lib   Started  2022-08-04T04:27:56Z
1  ef255030-1be2-4d4a-9d47-12cd4365a003  pandas, dill  install_lib  Finished  2022-08-04T04:29:12Z

Load example dataset 'admission_train'.

>>> load_example_data("dataframe", "admissions_train")

>>> df = DataFrame('admissions_train')
>>> print(df)
   masters   gpa     stats programming  admitted
id
22     yes  3.46    Novice    Beginner         0
36      no  3.00  Advanced      Novice         0
15     yes  4.00  Advanced    Advanced         1
38     yes  2.65  Advanced    Beginner         1
5       no  3.44    Novice      Novice         0
17      no  3.83  Advanced    Advanced         1
34     yes  3.85  Advanced    Beginner         0
13      no  4.00  Advanced      Novice         1
26     yes  3.57  Advanced    Advanced         1
19     yes  1.98  Advanced    Advanced         0

The following examples uses the 'admissions_train' dataset, to increase the 'gpa' by a given percentage.

Example 1: Create a user defined function to increase the 'gpa' by a given percentage

The input to and the output # from the function is a pandas Series object.

Create a user defined function.

>>> def increase_gpa(row, p=20):
        row['gpa'] = row['gpa'] + row['gpa'] * p/100
        return row

Apply the user defined function to the DataFrame.
Since the output of the user defined function expects the same columns with the same types, the 'returns' argument is skipped.
```
>>> increase_gpa_20 = df.apply(increase_gpa, env_name='testenv')
```

Print the result.

>>> print(increase_gpa_20)
   masters    gpa     stats programming  admitted
id
22     yes  4.152    Novice    Beginner         0
36      no  3.600  Advanced      Novice         0
15     yes  4.800  Advanced    Advanced         1
38     yes  3.180  Advanced    Beginner         1
5       no  4.128    Novice      Novice         0
17      no  4.596  Advanced    Advanced         1
34     yes  4.620  Advanced    Beginner         0
13      no  4.800  Advanced      Novice         1
26     yes  4.284  Advanced    Advanced         1
19     yes  2.376  Advanced    Advanced         0

Example 2: Use the same user defined function with a lambda notation to pass the percentage, 'p = 40'

Apply the user defined function to the DataFrame with a lambda notation.

>>> increase_gpa_40 = df.apply(lambda row: increase_gpa(row, p = 40),                                     env_name='testenv')

Print the result.

>>> print(increase_gpa_40)
   masters    gpa     stats programming  admitted
id
22     yes  4.844    Novice    Beginner         0
36      no  4.200  Advanced      Novice         0
15     yes  5.600  Advanced    Advanced         1
38     yes  3.710  Advanced    Beginner         1
5       no  4.816    Novice      Novice         0
17      no  5.362  Advanced    Advanced         1
34     yes  5.390  Advanced    Beginner         0
13      no  5.600  Advanced      Novice         1
26     yes  4.998  Advanced    Advanced         1
19     yes  2.772  Advanced    Advanced         0

Example 3: Use the same user defined function with functools.partial to pass the percentage, 'p = 50'

Load required library.
```
>>> from functools import partial
```

Apply the user defined function to the DataFrame with functools.partial.

>>> increase_gpa_50 = df.apply(partial(increase_gpa, p = 50),
                               env_name='testenv')

Print the result.

>>> print(increase_gpa_50)
   masters    gpa     stats programming  admitted
id
13      no  6.000  Advanced      Novice         1
26     yes  5.355  Advanced    Advanced         1
5       no  5.160    Novice      Novice         0
19     yes  2.970  Advanced    Advanced         0
15     yes  6.000  Advanced    Advanced         1
40     yes  5.925    Novice    Beginner         0
7      yes  3.495    Novice      Novice         1
22     yes  5.190    Novice    Beginner         0
36      no  4.500  Advanced      Novice         0
38     yes  3.975  Advanced    Beginner         1

Example 4: Use a lambda function to double 'gpa', and return numpy ndarray

Load required library.
```
>>> from numpy import asarray
```

Create a lambda function.

>>> inc_gpa_lambda = lambda row, p=20: asarray([row['id'],
                                   row['masters'],
                                   row['gpa'] + row['gpa'] * p/100,
                                   row['stats'],
                                   row['programming'],
                                   row['admitted']])

Apply the lambda function to the DataFrame.

>>> increase_gpa_100 = df.apply(lambda row: inc_gpa_lambda(row, p=100),
                                env_name='testenv')

Print the result.

>>> print(increase_gpa_100)
   masters   gpa     stats programming  admitted
id
13      no  8.00  Advanced      Novice         1
26     yes  7.14  Advanced    Advanced         1
5       no  6.88    Novice      Novice         0
19     yes  3.96  Advanced    Advanced         0
15     yes  8.00  Advanced    Advanced         1
40     yes  7.90    Novice    Beginner         0
7      yes  4.66    Novice      Novice         1
22     yes  6.92    Novice    Beginner         0
36      no  6.00  Advanced      Novice         0
38     yes  5.30  Advanced    Beginner         1