Teradata Package for Python Function Reference | 17.10 - assign - Teradata Package for Python
Teradata® Package for Python Function Reference
- Product
- Teradata Package for Python
- Release Number
- 17.10
- Published
- April 2022
- Language
- English (United States)
- Last Update
- 2022-08-19
- Product Category
- Teradata Vantage
- teradataml.dataframe.dataframe.DataFrame.assign = assign(self, drop_columns=False, **kwargs)
- DESCRIPTION:
Assign new columns to a teradataml DataFrame.
PARAMETERS:
drop_columns:
Optional Argument.
If True, drop columns that are not specified in assign.
Note:
When DataFrame.assign() is run on DataFrame.groupby(), this argument
is ignored. In such cases, all columns are dropped and only new columns
and grouping columns are returned.
Default Value: False
Types: bool
kwargs:
Specifies keyword-value pairs.
- keywords are the column names.
- values can be:
* Column arithmetic expressions.
* int/float/string literals.
* DataFrameColumn a.k.a. ColumnExpression Functions.
(Seee DataFrameColumn Functions in Function reference guide for more
details)
* SQLAlchemy ClauseElements.
(See teradataml extension with SQLAlchemy in teradataml User Guide
and Function reference guide for more details)
RETURNS:
teradataml DataFrame
A new DataFrame is returned with:
1. New columns in addition to all the existing columns if "drop_columns" is False.
2. Only new columns if "drop_columns" is True.
3. New columns in addition to group by columns, i.e., columns used for grouping,
if assign() is run on DataFrame.groupby().
NOTES:
1. The values in kwargs cannot be callable (functions).
2. The original DataFrame is not modified.
3. Since ``kwargs`` is a dictionary, the order of your
arguments may not be preserved. To make things predictable,
the columns are inserted in alphabetical order, after the existing columns
in the DataFrame. Assigning multiple columns within the same ``assign`` is
possible, but you cannot reference other columns created within the same
``assign`` call.
4. The maximum number of columns that a DataFrame can have is 2048.
5. With DataFrame.groupby(), only aggregate functions and literal values
are advised to use. Other functions, such as string functions, can also be
used, but the column used in such function must be a part of group by columns.
See examples for teradataml extension with SQLAlchemy on using various
functions with DataFrame.assign().
RAISES:
1. ValueError - When a callable is passed as a value, or columns from different
DataFrames are passed as values in kwargs.
2. TeradataMlException - When the return DataFrame initialization fails, or
invalid argument types are passed.
EXAMPLES:
>>> load_example_data("dataframe", "admissions_train")
>>> df = DataFrame("admissions_train")
>>> c1 = df.gpa
>>> c2 = df.id
>>>
#
# Executing assign() with Arithmetic operations on columns.
# All below examples use columns "gpa" and "id" for
# arithmetic operations to create a new DataFrame including the new columns.
#
# Let's take a look at various operations that can be performed
# using assign() and arithmetic operations on columns.
# Example 1: Addition of two columns "gpa" and "id".
>>> df.assign(new_column = c1 + c2).sort("id")
masters gpa stats programming admitted new_column
id
1 yes 3.95 Beginner Beginner 0 4.95
2 yes 3.76 Beginner Beginner 0 5.76
3 no 3.70 Novice Beginner 1 6.70
4 yes 3.50 Beginner Novice 1 7.50
5 no 3.44 Novice Novice 0 8.44
6 yes 3.50 Beginner Advanced 1 9.50
7 yes 2.33 Novice Novice 1 9.33
8 no 3.60 Beginner Advanced 1 11.60
9 no 3.82 Advanced Advanced 1 12.82
10 no 3.71 Advanced Advanced 1 13.71
>>>
# Example 2: Multiplication of columns "gpa" and "id".
>>> df.assign(new_column = c1 * c2).sort("id")
masters gpa stats programming admitted new_column
id
1 yes 3.95 Beginner Beginner 0 3.95
2 yes 3.76 Beginner Beginner 0 7.52
3 no 3.70 Novice Beginner 1 11.10
4 yes 3.50 Beginner Novice 1 14.00
5 no 3.44 Novice Novice 0 17.20
6 yes 3.50 Beginner Advanced 1 21.00
7 yes 2.33 Novice Novice 1 16.31
8 no 3.60 Beginner Advanced 1 28.80
9 no 3.82 Advanced Advanced 1 34.38
10 no 3.71 Advanced Advanced 1 37.10
>>>
# Example 3: Division of columns. Divide "id" by "gpa".
>>> df.assign(new_column = c2 / c1).sort("id")
masters gpa stats programming admitted new_column
id
1 yes 3.95 Beginner Beginner 0 0.253165
2 yes 3.76 Beginner Beginner 0 0.531915
3 no 3.70 Novice Beginner 1 0.810811
4 yes 3.50 Beginner Novice 1 1.142857
5 no 3.44 Novice Novice 0 1.453488
6 yes 3.50 Beginner Advanced 1 1.714286
7 yes 2.33 Novice Novice 1 3.004292
8 no 3.60 Beginner Advanced 1 2.222222
9 no 3.82 Advanced Advanced 1 2.356021
10 no 3.71 Advanced Advanced 1 2.695418
>>>
# Example 4: Subtract values in column "id" from "gpa".
>>> df.assign(new_column = c1 - c2).sort("id")
masters gpa stats programming admitted new_column
id
1 yes 3.95 Beginner Beginner 0 2.95
2 yes 3.76 Beginner Beginner 0 1.76
3 no 3.70 Novice Beginner 1 0.70
4 yes 3.50 Beginner Novice 1 -0.50
5 no 3.44 Novice Novice 0 -1.56
6 yes 3.50 Beginner Advanced 1 -2.50
7 yes 2.33 Novice Novice 1 -4.67
8 no 3.60 Beginner Advanced 1 -4.40
9 no 3.82 Advanced Advanced 1 -5.18
10 no 3.71 Advanced Advanced 1 -6.29
# Example 5: Modulo division of values in column "id" and "gpa".
>>> df.assign(new_column = c2 % c1).sort("id")
masters gpa stats programming admitted new_column
id
1 yes 3.95 Beginner Beginner 0 1.00
2 yes 3.76 Beginner Beginner 0 2.00
3 no 3.70 Novice Beginner 1 3.00
4 yes 3.50 Beginner Novice 1 0.50
5 no 3.44 Novice Novice 0 1.56
6 yes 3.50 Beginner Advanced 1 2.50
7 yes 2.33 Novice Novice 1 0.01
8 no 3.60 Beginner Advanced 1 0.80
9 no 3.82 Advanced Advanced 1 1.36
10 no 3.71 Advanced Advanced 1 2.58
>>>
#
# Executing assign() with literal values.
#
# Example 6: Adding an integer literal value to the values of columns "gpa" and "id".
>>> df.assign(c3 = c1 + 1, c4 = c2 + 1).sort("id")
masters gpa stats programming admitted c3 c4
id
1 yes 3.95 Beginner Beginner 0 4.95 2
2 yes 3.76 Beginner Beginner 0 4.76 3
3 no 3.70 Novice Beginner 1 4.70 4
4 yes 3.50 Beginner Novice 1 4.50 5
5 no 3.44 Novice Novice 0 4.44 6
6 yes 3.50 Beginner Advanced 1 4.50 7
7 yes 2.33 Novice Novice 1 3.33 8
8 no 3.60 Beginner Advanced 1 4.60 9
9 no 3.82 Advanced Advanced 1 4.82 10
10 no 3.71 Advanced Advanced 1 4.71 11
>>>
# Example 7: Create a new column with an integer literal value.
>>> df.assign(c1 = 1).sort("id")
masters gpa stats programming admitted c1
id
1 yes 3.95 Beginner Beginner 0 1
2 yes 3.76 Beginner Beginner 0 1
3 no 3.70 Novice Beginner 1 1
4 yes 3.50 Beginner Novice 1 1
5 no 3.44 Novice Novice 0 1
6 yes 3.50 Beginner Advanced 1 1
7 yes 2.33 Novice Novice 1 1
8 no 3.60 Beginner Advanced 1 1
9 no 3.82 Advanced Advanced 1 1
10 no 3.71 Advanced Advanced 1 1
>>>
# Example 8: Create a new column with a string literal value.
>>> df.assign(c3 = 'string').sort("id")
masters gpa stats programming admitted c3
id
1 yes 3.95 Beginner Beginner 0 string
2 yes 3.76 Beginner Beginner 0 string
3 no 3.70 Novice Beginner 1 string
4 yes 3.50 Beginner Novice 1 string
5 no 3.44 Novice Novice 0 string
6 yes 3.50 Beginner Advanced 1 string
7 yes 2.33 Novice Novice 1 string
8 no 3.60 Beginner Advanced 1 string
9 no 3.82 Advanced Advanced 1 string
10 no 3.71 Advanced Advanced 1 string
>>>
# Example 9: Concatenation of strings, a string literal and value from
# "masters" column.
# '+' operator is overridden for string columns.
>>> df.assign(concatenated = "Completed? " + df.masters).sort("id")
masters gpa stats programming admitted concatenated
id
1 yes 3.95 Beginner Beginner 0 Completed? yes
2 yes 3.76 Beginner Beginner 0 Completed? yes
3 no 3.70 Novice Beginner 1 Completed? no
4 yes 3.50 Beginner Novice 1 Completed? yes
5 no 3.44 Novice Novice 0 Completed? no
6 yes 3.50 Beginner Advanced 1 Completed? yes
7 yes 2.33 Novice Novice 1 Completed? yes
8 no 3.60 Beginner Advanced 1 Completed? no
9 no 3.82 Advanced Advanced 1 Completed? no
10 no 3.71 Advanced Advanced 1 Completed? no
>>>
#
# Significance of "drop_columns" in assign().
# Setting drop_columns to True will only return assigned expressions.
#
# Example 10: Drop all column and return new assigned expressions.
>>> df.assign(drop_columns = True,
... addc = c1 + c2,
... subc = c1 - c2,
... mulc = c1 * c2,
... divc = c1/c2).sort("addc")
addc divc mulc subc
0 4.95 3.950000 3.95 2.95
1 5.76 1.880000 7.52 1.76
2 6.70 1.233333 11.10 0.70
3 7.50 0.875000 14.00 -0.50
4 8.44 0.688000 17.20 -1.56
5 9.33 0.332857 16.31 -4.67
6 9.50 0.583333 21.00 -2.50
7 11.60 0.450000 28.80 -4.40
8 12.82 0.424444 34.38 -5.18
9 13.71 0.371000 37.10 -6.29
>>>
# Example 11: Duplicate a column with a new name.
# In the example here, we are duplicating:
# 1. Column "id" to new column "c1".
# 2. Column "gpa" to new column "c2".
>>> df.assign(c1 = c2, c2 = c1).sort("id")
masters gpa stats programming admitted c1 c2
id
1 yes 3.95 Beginner Beginner 0 1 3.95
2 yes 3.76 Beginner Beginner 0 2 3.76
3 no 3.70 Novice Beginner 1 3 3.70
4 yes 3.50 Beginner Novice 1 4 3.50
5 no 3.44 Novice Novice 0 5 3.44
6 yes 3.50 Beginner Advanced 1 6 3.50
7 yes 2.33 Novice Novice 1 7 2.33
8 no 3.60 Beginner Advanced 1 8 3.60
9 no 3.82 Advanced Advanced 1 9 3.82
10 no 3.71 Advanced Advanced 1 10 3.71
>>>
# Example 12: Renaming columns.
# Example 6 command can be modified a bit to rename columns, rather than
# duplicating it.
# Use "drop_column=True" in example 6 command to select all the desired columns.
>>> df.assign(drop_columns=True, c1 = c2, c2 = c1,
... masters=df.masters,
... stats=df.stats,
... programming=df.programming,
... admitted=df.admitted).sort("c1")
masters stats programming admitted c1 c2
0 yes Beginner Beginner 0 1 3.95
1 yes Beginner Beginner 0 2 3.76
2 no Novice Beginner 1 3 3.70
3 yes Beginner Novice 1 4 3.50
4 no Novice Novice 0 5 3.44
5 yes Beginner Advanced 1 6 3.50
6 yes Novice Novice 1 7 2.33
7 no Beginner Advanced 1 8 3.60
8 no Advanced Advanced 1 9 3.82
9 no Advanced Advanced 1 10 3.71
>>>
#
# Executing Aggregate Functions with assign() and DataFrame.groupby().
#
# Here, we will be using 'func' from sqlalchemy to run some aggregate functions.
>>> from sqlalchemy import func
>>>
# Example 13: Calculate average "gpa" for values in the "stats" column.
>>> df.groupby("stats").assign(res=func.ave(df.gpa.expression))
stats res
0 Beginner 3.662000
1 Advanced 3.508750
2 Novice 3.559091
>>>
# Example 14: Calculate standard deviation, kurtosis value and sum of values in
# the "gpa" column with values grouped by values in the "stats" column.
# Alternate approach for DataFrame.agg(). This allows user to name the
# result columns.
>>> df.groupby("stats").assign(gpa_std_=func.ave(df.gpa.expression),
... gpa_kurtosis_=func.kurtosis(df.gpa.expression),
... gpa_sum_=func.sum(df.gpa.expression))
stats gpa_kurtosis_ gpa_std_ gpa_sum_
0 Beginner -0.452859 3.662000 18.31
1 Advanced 2.886226 3.508750 84.21
2 Novice 6.377775 3.559091 39.15
>>>