Teradata Package for Python Function Reference | 20.00 - assign - Teradata Package for Python - Look here for syntax, methods and examples for the functions included in the Teradata Package for Python.
Teradata® Package for Python Function Reference - 20.00
- Deployment
- VantageCloud
- VantageCore
- Edition
- Enterprise
- IntelliFlex
- VMware
- Product
- Teradata Package for Python
- Release Number
- 20.00.00.03
- Published
- December 2024
- ft:locale
- en-US
- ft:lastEdition
- 2024-12-19
- dita:id
- TeradataPython_FxRef_Enterprise_2000
- lifecycle
- latest
- Product Category
- Teradata Vantage
- teradataml.dataframe.dataframe.DataFrame.assign = assign(self, drop_columns=False, **kwargs)
- DESCRIPTION:
Assign new columns to a teradataml DataFrame.
PARAMETERS:
drop_columns:
Optional Argument.
If True, drop columns that are not specified in assign.
Notes:
1. When DataFrame.assign() is run on DataFrame.groupby(), this argument
is ignored. In such cases, all columns are dropped and only new columns
and grouping columns are returned.
2. Argument is ignored for UDF functions.
Default Value: False
Types: bool
kwargs:
Specifies keyword-value pairs.
- keywords are the column names.
- values can be:
* Column arithmetic expressions.
* int/float/string literals.
* DataFrameColumn a.k.a. ColumnExpression Functions.
(Seee DataFrameColumn Functions in Function reference guide for more
details)
* SQLAlchemy ClauseElements.
(See teradataml extension with SQLAlchemy in teradataml User Guide
and Function reference guide for more details)
* Function - udf, call_udf.
RETURNS:
teradataml DataFrame
A new DataFrame is returned with:
1. New columns in addition to all the existing columns if "drop_columns" is False.
2. Only new columns if "drop_columns" is True.
3. New columns in addition to group by columns, i.e., columns used for grouping,
if assign() is run on DataFrame.groupby().
NOTES:
1. The values in kwargs cannot be callable (functions).
2. The original DataFrame is not modified.
3. Since ``kwargs`` is a dictionary, the order of your
arguments may not be preserved. To make things predictable,
the columns are inserted in alphabetical order, after the existing columns
in the DataFrame. Assigning multiple columns within the same ``assign`` is
possible, but you cannot reference other columns created within the same
``assign`` call.
4. The maximum number of columns that a DataFrame can have is 2048.
5. With DataFrame.groupby(), only aggregate functions and literal values
are advised to use. Other functions, such as string functions, can also be
used, but the column used in such function must be a part of group by columns.
See examples for teradataml extension with SQLAlchemy on using various
functions with DataFrame.assign().
6. UDF expressions can run on both Vantage Cloud Lake leveraging Apply Table Operator
of Open Analytics Framework and Enterprise leveraging Vantage's Script Table Operator.
7. One can pass both regular expressions and udf expressions to this API.
However, regular expressions are computed first followed by udf expressions.
Hence the order of columns also maintained in same order.
Look at Example 18 to understand more.
8. While passing multiple udf expressions, one can not pass one column output
as another column input in the same ``assign`` call.
9. If user pass multiple udf expressions, delimiter and quotechar specified in
last udf expression are considered for processing.
RAISES:
1. ValueError - When a callable is passed as a value, or columns from different
DataFrames are passed as values in kwargs.
2. TeradataMlException - When the return DataFrame initialization fails, or
invalid argument types are passed.
EXAMPLES:
>>> load_example_data("dataframe", "admissions_train")
>>> df = DataFrame("admissions_train")
>>> c1 = df.gpa
>>> c2 = df.id
>>>
#
# Executing assign() with Arithmetic operations on columns.
# All below examples use columns "gpa" and "id" for
# arithmetic operations to create a new DataFrame including the new columns.
#
# Let's take a look at various operations that can be performed
# using assign() and arithmetic operations on columns.
# Example 1: Addition of two columns "gpa" and "id".
>>> df.assign(new_column = c1 + c2).sort("id")
masters gpa stats programming admitted new_column
id
1 yes 3.95 Beginner Beginner 0 4.95
2 yes 3.76 Beginner Beginner 0 5.76
3 no 3.70 Novice Beginner 1 6.70
4 yes 3.50 Beginner Novice 1 7.50
5 no 3.44 Novice Novice 0 8.44
6 yes 3.50 Beginner Advanced 1 9.50
7 yes 2.33 Novice Novice 1 9.33
8 no 3.60 Beginner Advanced 1 11.60
9 no 3.82 Advanced Advanced 1 12.82
10 no 3.71 Advanced Advanced 1 13.71
>>>
# Example 2: Multiplication of columns "gpa" and "id".
>>> df.assign(new_column = c1 * c2).sort("id")
masters gpa stats programming admitted new_column
id
1 yes 3.95 Beginner Beginner 0 3.95
2 yes 3.76 Beginner Beginner 0 7.52
3 no 3.70 Novice Beginner 1 11.10
4 yes 3.50 Beginner Novice 1 14.00
5 no 3.44 Novice Novice 0 17.20
6 yes 3.50 Beginner Advanced 1 21.00
7 yes 2.33 Novice Novice 1 16.31
8 no 3.60 Beginner Advanced 1 28.80
9 no 3.82 Advanced Advanced 1 34.38
10 no 3.71 Advanced Advanced 1 37.10
>>>
# Example 3: Division of columns. Divide "id" by "gpa".
>>> df.assign(new_column = c2 / c1).sort("id")
masters gpa stats programming admitted new_column
id
1 yes 3.95 Beginner Beginner 0 0.253165
2 yes 3.76 Beginner Beginner 0 0.531915
3 no 3.70 Novice Beginner 1 0.810811
4 yes 3.50 Beginner Novice 1 1.142857
5 no 3.44 Novice Novice 0 1.453488
6 yes 3.50 Beginner Advanced 1 1.714286
7 yes 2.33 Novice Novice 1 3.004292
8 no 3.60 Beginner Advanced 1 2.222222
9 no 3.82 Advanced Advanced 1 2.356021
10 no 3.71 Advanced Advanced 1 2.695418
>>>
# Example 4: Subtract values in column "id" from "gpa".
>>> df.assign(new_column = c1 - c2).sort("id")
masters gpa stats programming admitted new_column
id
1 yes 3.95 Beginner Beginner 0 2.95
2 yes 3.76 Beginner Beginner 0 1.76
3 no 3.70 Novice Beginner 1 0.70
4 yes 3.50 Beginner Novice 1 -0.50
5 no 3.44 Novice Novice 0 -1.56
6 yes 3.50 Beginner Advanced 1 -2.50
7 yes 2.33 Novice Novice 1 -4.67
8 no 3.60 Beginner Advanced 1 -4.40
9 no 3.82 Advanced Advanced 1 -5.18
10 no 3.71 Advanced Advanced 1 -6.29
# Example 5: Modulo division of values in column "id" and "gpa".
>>> df.assign(new_column = c2 % c1).sort("id")
masters gpa stats programming admitted new_column
id
1 yes 3.95 Beginner Beginner 0 1.00
2 yes 3.76 Beginner Beginner 0 2.00
3 no 3.70 Novice Beginner 1 3.00
4 yes 3.50 Beginner Novice 1 0.50
5 no 3.44 Novice Novice 0 1.56
6 yes 3.50 Beginner Advanced 1 2.50
7 yes 2.33 Novice Novice 1 0.01
8 no 3.60 Beginner Advanced 1 0.80
9 no 3.82 Advanced Advanced 1 1.36
10 no 3.71 Advanced Advanced 1 2.58
>>>
#
# Executing assign() with literal values.
#
# Example 6: Adding an integer literal value to the values of columns "gpa" and "id".
>>> df.assign(c3 = c1 + 1, c4 = c2 + 1).sort("id")
masters gpa stats programming admitted c3 c4
id
1 yes 3.95 Beginner Beginner 0 4.95 2
2 yes 3.76 Beginner Beginner 0 4.76 3
3 no 3.70 Novice Beginner 1 4.70 4
4 yes 3.50 Beginner Novice 1 4.50 5
5 no 3.44 Novice Novice 0 4.44 6
6 yes 3.50 Beginner Advanced 1 4.50 7
7 yes 2.33 Novice Novice 1 3.33 8
8 no 3.60 Beginner Advanced 1 4.60 9
9 no 3.82 Advanced Advanced 1 4.82 10
10 no 3.71 Advanced Advanced 1 4.71 11
>>>
# Example 7: Create a new column with an integer literal value.
>>> df.assign(c1 = 1).sort("id")
masters gpa stats programming admitted c1
id
1 yes 3.95 Beginner Beginner 0 1
2 yes 3.76 Beginner Beginner 0 1
3 no 3.70 Novice Beginner 1 1
4 yes 3.50 Beginner Novice 1 1
5 no 3.44 Novice Novice 0 1
6 yes 3.50 Beginner Advanced 1 1
7 yes 2.33 Novice Novice 1 1
8 no 3.60 Beginner Advanced 1 1
9 no 3.82 Advanced Advanced 1 1
10 no 3.71 Advanced Advanced 1 1
>>>
# Example 8: Create a new column with a string literal value.
>>> df.assign(c3 = 'string').sort("id")
masters gpa stats programming admitted c3
id
1 yes 3.95 Beginner Beginner 0 string
2 yes 3.76 Beginner Beginner 0 string
3 no 3.70 Novice Beginner 1 string
4 yes 3.50 Beginner Novice 1 string
5 no 3.44 Novice Novice 0 string
6 yes 3.50 Beginner Advanced 1 string
7 yes 2.33 Novice Novice 1 string
8 no 3.60 Beginner Advanced 1 string
9 no 3.82 Advanced Advanced 1 string
10 no 3.71 Advanced Advanced 1 string
>>>
# Example 9: Concatenation of strings, a string literal and value from
# "masters" column.
# '+' operator is overridden for string columns.
>>> df.assign(concatenated = "Completed? " + df.masters).sort("id")
masters gpa stats programming admitted concatenated
id
1 yes 3.95 Beginner Beginner 0 Completed? yes
2 yes 3.76 Beginner Beginner 0 Completed? yes
3 no 3.70 Novice Beginner 1 Completed? no
4 yes 3.50 Beginner Novice 1 Completed? yes
5 no 3.44 Novice Novice 0 Completed? no
6 yes 3.50 Beginner Advanced 1 Completed? yes
7 yes 2.33 Novice Novice 1 Completed? yes
8 no 3.60 Beginner Advanced 1 Completed? no
9 no 3.82 Advanced Advanced 1 Completed? no
10 no 3.71 Advanced Advanced 1 Completed? no
>>>
#
# Significance of "drop_columns" in assign().
# Setting drop_columns to True will only return assigned expressions.
#
# Example 10: Drop all column and return new assigned expressions.
>>> df.assign(drop_columns = True,
... addc = c1 + c2,
... subc = c1 - c2,
... mulc = c1 * c2,
... divc = c1/c2).sort("addc")
addc divc mulc subc
0 4.95 3.950000 3.95 2.95
1 5.76 1.880000 7.52 1.76
2 6.70 1.233333 11.10 0.70
3 7.50 0.875000 14.00 -0.50
4 8.44 0.688000 17.20 -1.56
5 9.33 0.332857 16.31 -4.67
6 9.50 0.583333 21.00 -2.50
7 11.60 0.450000 28.80 -4.40
8 12.82 0.424444 34.38 -5.18
9 13.71 0.371000 37.10 -6.29
>>>
# Example 11: Duplicate a column with a new name.
# In the example here, we are duplicating:
# 1. Column "id" to new column "c1".
# 2. Column "gpa" to new column "c2".
>>> df.assign(c1 = c2, c2 = c1).sort("id")
masters gpa stats programming admitted c1 c2
id
1 yes 3.95 Beginner Beginner 0 1 3.95
2 yes 3.76 Beginner Beginner 0 2 3.76
3 no 3.70 Novice Beginner 1 3 3.70
4 yes 3.50 Beginner Novice 1 4 3.50
5 no 3.44 Novice Novice 0 5 3.44
6 yes 3.50 Beginner Advanced 1 6 3.50
7 yes 2.33 Novice Novice 1 7 2.33
8 no 3.60 Beginner Advanced 1 8 3.60
9 no 3.82 Advanced Advanced 1 9 3.82
10 no 3.71 Advanced Advanced 1 10 3.71
>>>
# Example 12: Renaming columns.
# Example 6 command can be modified a bit to rename columns, rather than
# duplicating it.
# Use "drop_column=True" in example 6 command to select all the desired columns.
>>> df.assign(drop_columns=True, c1 = c2, c2 = c1,
... masters=df.masters,
... stats=df.stats,
... programming=df.programming,
... admitted=df.admitted).sort("c1")
masters stats programming admitted c1 c2
0 yes Beginner Beginner 0 1 3.95
1 yes Beginner Beginner 0 2 3.76
2 no Novice Beginner 1 3 3.70
3 yes Beginner Novice 1 4 3.50
4 no Novice Novice 0 5 3.44
5 yes Beginner Advanced 1 6 3.50
6 yes Novice Novice 1 7 2.33
7 no Beginner Advanced 1 8 3.60
8 no Advanced Advanced 1 9 3.82
9 no Advanced Advanced 1 10 3.71
>>>
#
# Executing Aggregate Functions with assign() and DataFrame.groupby().
#
# Here, we will be using 'func' from sqlalchemy to run some aggregate functions.
>>> from sqlalchemy import func
>>>
# Example 13: Calculate average "gpa" for values in the "stats" column.
>>> df.groupby("stats").assign(res=func.ave(df.gpa.expression))
stats res
0 Beginner 3.662000
1 Advanced 3.508750
2 Novice 3.559091
>>>
# Example 14: Calculate standard deviation, kurtosis value and sum of values in
# the "gpa" column with values grouped by values in the "stats" column.
# Alternate approach for DataFrame.agg(). This allows user to name the
# result columns.
>>> df.groupby("stats").assign(gpa_std_=func.ave(df.gpa.expression),
... gpa_kurtosis_=func.kurtosis(df.gpa.expression),
... gpa_sum_=func.sum(df.gpa.expression))
stats gpa_kurtosis_ gpa_std_ gpa_sum_
0 Beginner -0.452859 3.662000 18.31
1 Advanced 2.886226 3.508750 84.21
2 Novice 6.377775 3.559091 39.15
>>>
#
# Executing user defined function (UDF) with assign()
#
# Example 15: Create two user defined functions to 'to_upper' and 'sum',
# 'to_upper' to get the values in 'accounts' to upper case and
# 'sum' to add length of string values in column 'accounts'
# with column 'Feb' and store the result in Integer type column.
>>> @udf
... def to_upper(s):
... if s is not None:
... return s.upper()
>>>
>>> from teradatasqlalchemy.types import INTEGER
>>> @udf(returns=INTEGER())
... def sum(x, y):
... return len(x)+y
>>>
# Assign both Column Expressions returned by user defined functions
# to the DataFrame.
>>> res = df.assign(upper_stats = to_upper('accounts'), len_sum = sum('accounts', 'Feb'))
>>> res
Feb Jan Mar Apr datetime upper_stats len_sum
accounts
Blue Inc 90.0 50.0 95.0 101.0 17/01/04 BLUE INC 98
Red Inc 200.0 150.0 140.0 NaN 17/01/04 RED INC 207
Yellow Inc 90.0 NaN NaN NaN 17/01/04 YELLOW INC 100
Jones LLC 200.0 150.0 140.0 180.0 17/01/04 JONES LLC 209
Orange Inc 210.0 NaN NaN 250.0 17/01/04 ORANGE INC 220
Alpha Co 210.0 200.0 215.0 250.0 17/01/04 ALPHA CO 218
>>>
# Example 16: Create a user defined function to add 4 to the 'datetime' column
# and store the result in DATE type column.
>>> from teradatasqlalchemy.types import DATE
>>> import datetime
>>> @udf(returns=DATE())
... def add_date(x, y):
... return (datetime.datetime.strptime(x, "%y/%m/%d")+datetime.timedelta(y)).strftime("%y/%m/%d")
>>>
# Assign the Column Expression returned by user defined function
# to the DataFrame.
>>> res = df.assign(new_date = add_date('datetime', 4))
>>> res
Feb Jan Mar Apr datetime new_date
accounts
Alpha Co 210.0 200.0 215.0 250.0 17/01/04 17/01/08
Blue Inc 90.0 50.0 95.0 101.0 17/01/04 17/01/08
Jones LLC 200.0 150.0 140.0 180.0 17/01/04 17/01/08
Orange Inc 210.0 NaN NaN 250.0 17/01/04 17/01/08
Yellow Inc 90.0 NaN NaN NaN 17/01/04 17/01/08
Red Inc 200.0 150.0 140.0 NaN 17/01/04 17/01/08
>>>
# Example 17: Create a user defined functions to 'to_upper' to get
# the values in 'accounts' to upper case and create a
# new column with a string literal value.
>>> @udf
... def to_upper(s):
... if s is not None:
... return s.upper()
>>>
# Assign both expressions to the DataFrame.
>>> res = df.assign(upper_stats = to_upper('accounts'), new_col = 'string')
>>> res
Feb Jan Mar Apr datetime new_col upper_stats
accounts
Alpha Co 210.0 200.0 215.0 250.0 17/01/04 string ALPHA CO
Blue Inc 90.0 50.0 95.0 101.0 17/01/04 string BLUE INC
Yellow Inc 90.0 NaN NaN NaN 17/01/04 string YELLOW INC
Jones LLC 200.0 150.0 140.0 180.0 17/01/04 string JONES LLC
Red Inc 200.0 150.0 140.0 NaN 17/01/04 string RED INC
Orange Inc 210.0 NaN NaN 250.0 17/01/04 string ORANGE INC
>>>
# Example 18: Create two user defined functions to 'to_upper' and 'sum'
# and create new columns with string literal value and
# arithmetic operation on column 'Feb'.
>>> @udf
... def to_upper(s):
... if s is not None:
... return s.upper()
>>>
>>> from teradatasqlalchemy.types import INTEGER
>>> @udf(returns=INTEGER())
... def sum(x, y):
... return len(x)+y
>>>
# Assign all expressions to the DataFrame.
>>> res = df.assign(upper_stats = to_upper('accounts'),new_col = 'abc',
... len_sum = sum('accounts', 'Feb'), col_sum = df.Feb+1)
>>> res
Feb Jan Mar Apr datetime col_sum new_col upper_stats len_sum
accounts
Blue Inc 90.0 50.0 95.0 101.0 17/01/04 91.0 abc BLUE INC 98
Alpha Co 210.0 200.0 215.0 250.0 17/01/04 211.0 abc ALPHA CO 218
Jones LLC 200.0 150.0 140.0 180.0 17/01/04 201.0 abc JONES LLC 209
Yellow Inc 90.0 NaN NaN NaN 17/01/04 91.0 abc YELLOW INC 100
Orange Inc 210.0 NaN NaN 250.0 17/01/04 211.0 abc ORANGE INC 220
Red Inc 200.0 150.0 140.0 NaN 17/01/04 201.0 abc RED INC 207
>>>
# Example 19: Convert the values is 'accounts' column to upper case using a user
# defined function on Vantage Cloud Lake.
# Create a Python 3.10.5 environment with given name and description in Vantage.
>>> env = create_env('test_udf', 'python_3.10.5', 'Test environment for UDF')
User environment 'test_udf' created.
>>>
# Create a user defined functions to 'to_upper' to get the values in upper case
# and pass the user env to run it on.
>>> from teradataml.dataframe.functions import udf
>>> @udf(env_name = env)
... def to_upper(s):
... if s is not None:
... return s.upper()
>>>
# Assign the Column Expression returned by user defined function
# to the DataFrame.
>>> df.assign(upper_stats = to_upper('accounts'))
Feb Jan Mar Apr datetime upper_stats
accounts
Alpha Co 210.0 200.0 215.0 250.0 17/01/04 ALPHA CO
Blue Inc 90.0 50.0 95.0 101.0 17/01/04 BLUE INC
Yellow Inc 90.0 NaN NaN NaN 17/01/04 YELLOW INC
Jones LLC 200.0 150.0 140.0 180.0 17/01/04 JONES LLC
Orange Inc 210.0 NaN NaN 250.0 17/01/04 ORANGE INC
Red Inc 200.0 150.0 140.0 NaN 17/01/04 RED INC
>>>
# Example 20: Register and Call the user defined function to get the values upper case.
>>> from teradataml.dataframe.functions import udf, register, call_udf
>>> @udf
... def to_upper(s):
... if s is not None:
... return s.upper()
>>>
# Register the created user defined function with name "upper".
>>> register("upper", to_upper)
>>>
# Call the user defined function registered with name "upper" and assign the
# ColumnExpression returned to the DataFrame.
>>> res = df.assign(upper_col = call_udf("upper", ('accounts',)))
>>> res
Feb Jan Mar Apr datetime upper_col
accounts
Alpha Co 210.0 200.0 215.0 250.0 17/01/04 ALPHA CO
Blue Inc 90.0 50.0 95.0 101.0 17/01/04 BLUE INC
Yellow Inc 90.0 NaN NaN NaN 17/01/04 YELLOW INC
Jones LLC 200.0 150.0 140.0 180.0 17/01/04 JONES LLC
Orange Inc 210.0 NaN NaN 250.0 17/01/04 ORANGE INC
Red Inc 200.0 150.0 140.0 NaN 17/01/04 RED INC
>>>