Teradata Package for Python Function Reference | 20.00 - apply - Teradata Package for Python - Look here for syntax, methods and examples for the functions included in the Teradata Package for Python.
Teradata® Package for Python Function Reference - 20.00
- Deployment
- VantageCloud
- VantageCore
- Edition
- Enterprise
- IntelliFlex
- VMware
- Product
- Teradata Package for Python
- Release Number
- 20.00.00.03
- Published
- December 2024
- ft:locale
- en-US
- ft:lastEdition
- 2024-12-19
- dita:id
- TeradataPython_FxRef_Enterprise_2000
- lifecycle
- latest
- Product Category
- Teradata Vantage
- teradataml.dataframe.dataframe.DataFrame.apply = apply(self, user_function, exec_mode='REMOTE', chunk_size=1000, num_rows=1000, **kwargs)
- DESCRIPTION:
Function to apply a user-defined function(UDF) to each row in the
teradataml DataFrame, leveraging Apply Table Operator of Open
Analytics Framework.
Notes:
1. The function requires dill package with same version in both remote environment
and local environment.
2. Teradata recommends to use same Python version in both remote and local environment.
PARAMETERS:
user_function:
Required Argument.
Specifies the user defined function to apply to each row in
the teradataml DataFrame.
Types: function or functools.partial
Notes:
* This can be either a lambda function, a regular python
function, or an object of functools.partial.
* The first argument (positional) to the UDF must be a row
in a Pandas DataFrame corresponding to the teradataml
DataFrame to which it is to be applied.
* A non-lambda function can be passed only when the UDF
does not accept any arguments other than
the mandatory input - the input row.
A user can also use functools.partial and lambda functions
for the same, which are especially handy when:
* there is a need to pass positional and/or keyword
arguments (lambda).
* there is a need to pass keyword arguments only
(functool.partial).
* The return type of the UDF must be one
of the following:
* numpy ndarray
* when one-dimensional, having the same number of
values as output columns.
* when two-dimensional, every array contained in
the outer array having the same number of values
as output columns.
* pandas Series
* pandas DataFrame
* To use apply on the UDF, packages dill and pandas must be
installed in remote user environment using install_lib function
of UserEnv class. Refer to help(UserEnv.install_lib).
exec_mode:
Optional Argument.
Specifies the mode of execution for the UDF.
Default value: 'REMOTE'
Types: str
chunk_size:
Optional Argument.
Specifies the number of rows to be read in a chunk in each
iteration using an iterator to apply the UDF
to each row in the chunk.
Varying the value passed to this argument affects the
performance and the memory utilized by the function.
Default value: 1000
Types: int
num_rows:
For future use.
env_name:
Required Argument.
Specifies the name of the remote user environment or an object of
class UserEnv.
Types: str or oject of class UserEnv.
returns:
Optional Argument.
Specifies the output column definition corresponding to the
output of "user_function".
When not specified, the function assumes that the names and
types of the output columns are same as that of the input.
Types: Dictionary specifying column name to
teradatasqlalchemy type mapping.
delimiter:
Optional Argument.
Specifies a delimiter to use when reading columns from a row and
writing result columns.
Default value: ','
Types: str with one character
Notes:
* This argument cannot be same as "quotechar" argument.
* This argument cannot be a newline character ('
').
quotechar:
Optional Argument.
Specifies a character that forces all input and output of the
user function to be quoted using this specified character.
Using this argument enables the Advanced SQL Engine to
distinguish between NULL fields and empty strings.
A string with length zero is quoted, while NULL fields are not.
If this character is found in the data, it will be escaped by a
second quote character.
Types: str with one character
Notes:
* This argument cannot be same as "delimiter" argument.
* This argument cannot be a newline character ('
').
data_partition_column:
Optional Argument.
Specifies the Partition By columns for the teradataml DataFrame.
Values to this argument can be provided as a list, if multiple
columns are used for partition. If there is no "data_partition_column",
then the entire result set delivered by the function, constitutes a single
group or partition.
Default Value: ANY
Types: str OR list of Strings (str)
Notes:
1) "data_partition_column" can not be specified along with "data_hash_column".
2) "data_partition_column" can not be specified along with "is_local_order = True".
data_hash_column:
Optional Argument.
Specifies the column to be used for hashing.
The rows in the teradataml DataFrame are redistributed to AMPs
based on the hash value of the column specified.
If there is no data_hash_column, then the entire result
set, delivered by the function, constitutes a single group or
partition.
Types: str
Notes:
1. "data_hash_column" can not be specified along with "data_partition_column".
2. "data_hash_column" can not be specified along with "is_local_order=False" and
"data_order_column".
data_order_column:
Optional Argument.
Specifies the Order By columns for the teradataml DataFrame.
Values to this argument can be provided as a list, if multiple
columns are used for ordering.
This argument is used with in both cases:
"is_local_order = True" and "is_local_order = False".
Types: str OR list of Strings (str)
Note:
"data_order_column" can not be specified along with "data_hash_column".
is_local_order:
Optional Argument.
Specifies a boolean value to determine whether the input data is
to be ordered locally or not.
"data_order_column" with "is_local_order" set to 'False'
specifies the order in which the values in a group, or
partition, are sorted.
When this argument is set to 'True', qualified rows on each AMP
are ordered in preparation to be input to a table function.
This argument is ignored, if "data_order_column" is None.
Default value: False
Types: bool
Note:
When "is_local_order" is set to 'True', "data_order_column" should be
specified, and the columns specified in "data_order_column"
are used for local ordering.
nulls_first:
Optional Argument.
Specifies a boolean value to determine whether NULLS are listed
first or last during ordering.
This argument is ignored, if "data_order_column" is None.
NULLS are listed first when this argument is set to 'True', and
last when set to 'False'.
Default value: True
Types: bool
sort_ascending:
Optional Argument.
Specifies a boolean value to determine if the result set is to
be sorted on the "data_order_column" column in ascending or
descending order.
The sorting is ascending when this argument is set to 'True',
and descending when set to 'False'.
This argument is ignored, if "data_order_column" is None.
Default Value: True
Types: bool
style:
Optional Argument.
Specifies how input is passed to and output is generated by the 'apply_command'
respectively.
Note:
This clause only supports 'csv' value for Apply.
Default value: "csv"
Types: str
RETURNS:
teradataml DataFrame.
RAISES:
TypeError, TeradataMlException.
EXAMPLES:
# Create a Python 3.7.9 environment with given name and description in Vantage.
>>> env = create_env('testenv', 'python_3.7.9', 'Test environment')
User environment testenv created.
# Packages dill and pandas must be installed in remote user environment.
>>> env.install_lib(['pandas','dill'])
Request to install libraries initiated successfully in the remote user environment demo_env. Check the status using status() with the claim id 'ef255030-1be2-4d4a-9d47-12cd4365a003'.
# Check the status of installation.
>>> env.status('ef255030-1be2-4d4a-9d47-12cd4365a003')
Claim Id File/Libs Method Name Stage Timestamp Additional Details
0 ef255030-1be2-4d4a-9d47-12cd4365a003 pandas, dill install_lib Started 2022-08-04T04:27:56Z
1 ef255030-1be2-4d4a-9d47-12cd4365a003 pandas, dill install_lib Finished 2022-08-04T04:29:12Z
>>>
# This example uses the 'admissions_train' dataset, to increase
# the 'gpa' by a give percentage.
# Load the example data.
>>> load_example_data("dataframe", "admissions_train")
>>> df = DataFrame('admissions_train')
>>> print(df)
masters gpa stats programming admitted
id
22 yes 3.46 Novice Beginner 0
36 no 3.00 Advanced Novice 0
15 yes 4.00 Advanced Advanced 1
38 yes 2.65 Advanced Beginner 1
5 no 3.44 Novice Novice 0
17 no 3.83 Advanced Advanced 1
34 yes 3.85 Advanced Beginner 0
13 no 4.00 Advanced Novice 1
26 yes 3.57 Advanced Advanced 1
19 yes 1.98 Advanced Advanced 0
# Example 1:
# Create the user defined function to increase the 'gpa' by the
# percentage provided. Note that the input to and the output
# from the function is a Pandas Series object.
>>> def increase_gpa(row, p=20):
... row['gpa'] = row['gpa'] + row['gpa'] * p/100
... return row
...
>>>
# Apply the user defined function to the DataFrame.
# Note that since the output of the user defined function
# expects the same columns with the same types, we can skip
# passing the 'returns' argument.
>>> increase_gpa_20 = df.apply(increase_gpa, env_name='testenv')
>>>
>>> # Print the result.
>>> print(increase_gpa_20)
masters gpa stats programming admitted
id
22 yes 4.152 Novice Beginner 0
36 no 3.600 Advanced Novice 0
15 yes 4.800 Advanced Advanced 1
38 yes 3.180 Advanced Beginner 1
5 no 4.128 Novice Novice 0
17 no 4.596 Advanced Advanced 1
34 yes 4.620 Advanced Beginner 0
13 no 4.800 Advanced Novice 1
26 yes 4.284 Advanced Advanced 1
19 yes 2.376 Advanced Advanced 0
# Example 2:
# Use the same user defined function with a lambda notation to
# pass the percentage, 'p = 40'.
>>> increase_gpa_40 = df.apply(lambda row: increase_gpa(row,
... p = 40),
... env_name='testenv')
>>>
>>> print(increase_gpa_40)
masters gpa stats programming admitted
id
22 yes 4.844 Novice Beginner 0
36 no 4.200 Advanced Novice 0
15 yes 5.600 Advanced Advanced 1
38 yes 3.710 Advanced Beginner 1
5 no 4.816 Novice Novice 0
17 no 5.362 Advanced Advanced 1
34 yes 5.390 Advanced Beginner 0
13 no 5.600 Advanced Novice 1
26 yes 4.998 Advanced Advanced 1
19 yes 2.772 Advanced Advanced 0
# Example 3:
# Use the same user defined function with functools.partial to
# pass the percentage, 'p = 50'.
>>> from functools import partial
>>> increase_gpa_50 = df.apply(partial(increase_gpa, p = 50),
... env_name='testenv')
>>>
>>> print(increase_gpa_50)
masters gpa stats programming admitted
id
13 no 6.000 Advanced Novice 1
26 yes 5.355 Advanced Advanced 1
5 no 5.160 Novice Novice 0
19 yes 2.970 Advanced Advanced 0
15 yes 6.000 Advanced Advanced 1
40 yes 5.925 Novice Beginner 0
7 yes 3.495 Novice Novice 1
22 yes 5.190 Novice Beginner 0
36 no 4.500 Advanced Novice 0
38 yes 3.975 Advanced Beginner 1
# Example 4:
# Use a lambda function to double 'gpa', and
# return numpy ndarray.
>>> from numpy import asarray
>>> inc_gpa_lambda = lambda row, p=20: asarray([row['id'],
... row['masters'],
... row['gpa'] + row['gpa'] * p/100,
... row['stats'],
... row['programming'],
... row['admitted']])
>>> increase_gpa_100 = df.apply(lambda row: inc_gpa_lambda(row,
... p=100),
... env_name='testenv')
>>>
>>> print(increase_gpa_100)
masters gpa stats programming admitted
id
13 no 8.00 Advanced Novice 1
26 yes 7.14 Advanced Advanced 1
5 no 6.88 Novice Novice 0
19 yes 3.96 Advanced Advanced 0
15 yes 8.00 Advanced Advanced 1
40 yes 7.90 Novice Beginner 0
7 yes 4.66 Novice Novice 1
22 yes 6.92 Novice Beginner 0
36 no 6.00 Advanced Novice 0
38 yes 5.30 Advanced Beginner 1