Teradata Package for Python Function Reference on VantageCloud Lake - OneHotEncoder - Teradata Package for Python - Look here for syntax, methods and examples for the functions included in the Teradata Package for Python.
Teradata® Package for Python Function Reference on VantageCloud Lake
- Deployment
- VantageCloud
- Edition
- Lake
- Product
- Teradata Package for Python
- Release Number
- 20.00.00.03
- Published
- December 2024
- ft:locale
- en-US
- ft:lastEdition
- 2024-12-19
- dita:id
- TeradataPython_FxRef_Lake_2000
- Product Category
- Teradata Vantage
- teradataml.analytics.Transformations.OneHotEncoder.__init__ = __init__(self, values, columns, style='dummy', reference_value=None, out_columns=None, datatype=None, fillna=None)
- DESCRIPTION:
One hot encoding is useful when a categorical data element must be re-expressed
as one or more numeric data elements, creating a binary numeric field for
each categorical data value. One hot encoding supports character, numeric,
and date type columns.
One hot encoding is offered in two forms: dummy-coding and contrast-coding.
* In dummy-coding, a new column is produced for each listed value, with
a value of 0 or 1 depending on whether that value is assumed by the
original column. If a column assumes n values, new columns can be
created for all n values, (or for only n-1 values, because the nth
column is perfectly correlated with the first n-1 columns).
* Alternately, given a list of values to contrast-code along with a
reference value, a new column is produced for each listed value, with
a value of 0 or 1 depending on whether that value is assumed by the
original column, or a value of -1 if that original value is equal to
the reference value.
Note:
Output of this function is passed to "one_hot_encode" argument of
"Transform" function from Vantage Analytic Library.
PARAMETERS:
values:
Required Argument.
Specifies the values to code and optionally the name of the
resulting output column.
Note:
1. If date values are entered as string, the keyword 'DATE' must precede
the date value, and do not enclose in single quotes OR
pass a datetime.date object.
For example,
value='DATE 1987-06-09'
value=date(1987, 6, 9)
2. Use a dict to pass value when result output column is to be named.
key of the dictionary must be the value to code and value must be
either None, in case result output column is not to be named or a
string if it is to be named.
For example,
values = {"Male": M, "Female": None}
In the example above,
- we would like to name the output column as 'M' for one hot
encoded values for "Male" and
- for the one hot encoding values of "Female" we would like to
have the output name contain/same as that of "Female", thus
None is passed as a value.
Types: bool, float, int, str, dict, datetime.date or list of booleans, floats, integers,
strings, datetime.date
columns:
Required Argument.
Specifies the name of the column. Value passed to this argument
also plays a crucial role in determining the output column name.
Types: str
style:
Optional Argument.
Specifies the one hot encoding style to use.
Permitted Values: 'dummy', 'contrast'
Default Value: 'dummy'
Types: str
reference_value:
Required Argument when "style" is 'contrast', ignored otherwise.
Specifies the reference value to use for 'contrast' style. If original
value in the column is equal to the reference value then -1 is returned
for the same.
Types: bool, int, float, str, datetme.date
out_columns:
Optional Argument.
Specifies the name of the output column. Value passed to this argument
also plays a crucial role in determining the output column name.
Types: str
datatype:
Optional Argument.
Specifies the name of the intended datatype of the output column.
Intended data types for the output column can be specified using either the
teradatasqlalchemy types or the permitted strings mentioned below:
-------------------------------------------------------------------
| If intended SQL Data Type is | Permitted Value to be passed is |
|-------------------------------------------------------------------|
| bigint | bigint |
| byteint | byteint |
| char(n) | char,n |
| date | date |
| decimal(m,n) | decimal,m,n |
| float | float |
| integer | integer |
| number(*) | number |
| number(n) | number,n |
| number(*,n) | number,*,n |
| number(n,n) | number,n,n |
| smallint | smallint |
| time(p) | time,p |
| timestamp(p) | timestamp,p |
| varchar(n) | varchar,n |
--------------------------------------------------------------------
Notes:
1. Argument is ignored if "columns" argument is not used.
2. char without a size is not supported.
3. number(*) does not include the * in its datatype format.
Examples:
1. If intended datatype for the output column is "bigint", then
pass string "bigint" to the argument as shown below:
datatype="bigint"
2. If intended datatype for the output column is "decimal(3,5)", then
pass string "decimal,3,5" to the argument as shown below:
datatype="decimal,3,5"
Types: str, BIGINT, BYTEINT, CHAR, DATE, DECIMAL, FLOAT, INTEGER, NUMBER, SMALLINT, TIME,
TIMESTAMP, VARCHAR.
fillna:
Optional Argument.
Specifies whether the null replacement/missing value treatment should
be performed with one hot encoding or not. Output of FillNa() can be
passed to this argument.
Note:
If the FillNa object is created with its arguments "columns",
"out_columns" and "datatype", then values passed in FillNa() arguments
are ignored. Only nullstyle information is captured from the same.
Types: FillNa
NOTES:
Output column names for the transformation using Transform() function depends
on "values", "columns" and "out_columns" arguments. Here is how output column
names are determined:
1. If "values" is not dictionary:
a. If "out_columns" is not passed, then output column is formed
using the value in "values" and column name passed to "columns".
For example,
If values=["val1", "val2"] and columns="col"
then, output column names are:
'val1_col' and 'val2_col'
b. If "out_columns" is passed, then output column is formed
using the value in "values" and column name passed to "out_columns".
For example,
If values=["val1", "val2"], columns="col", and
out_columns="ocol" then, output column names are:
'val1_ocol' and 'val2_ocol'
2. If "values" is a dictionary:
a. If value in a dictionary is not None, then that value is used
as output column name.
For example:
If values = {"val1": "v1"} then output column name is "v1".
b. If value in a dictionary is None, then rules specified in point 1
are applied to determine the output column name.
RETURNS:
An instance of OneHotEncoder class.
RAISES:
TeradataMlException, TypeError, ValueError
EXAMPLE:
# Note:
# To run any transformation, user needs to use Transform() function from
# Vantage Analytic Library.
# To do so import valib first and set the "val_install_location".
>>> from teradataml import configure, DataFrame, OneHotEncoder, FillNa, load_example_data, valib
>>> configure.val_install_location = "SYSLIB"
>>>
# Load example data.
>>> load_example_data("dataframe", "admissions_train")
>>>
# Create the required DataFrame.
>>> df = DataFrame("admissions_train")
>>> df
masters gpa stats programming admitted
id
13 no 4.00 Advanced Novice 1
26 yes 3.57 Advanced Advanced 1
5 no 3.44 Novice Novice 0
19 yes 1.98 Advanced Advanced 0
15 yes 4.00 Advanced Advanced 1
40 yes 3.95 Novice Beginner 0
7 yes 2.33 Novice Novice 1
22 yes 3.46 Novice Beginner 0
36 no 3.00 Advanced Novice 0
38 yes 2.65 Advanced Beginner 1
>>>
# Example 1: Encode all values 'Novice', 'Advanced', and 'Beginner'
# in "programming" column using "dummy" style.
>>> dc = OneHotEncoder(values=["Novice", "Advanced", "Beginner"], columns="programming")
# Execute Transform() function.
>>> obj = valib.Transform(data=df, one_hot_encode=dc, key_columns="id")
>>> obj.result
id Novice_programming Advanced_programming Beginner_programming
0 5 1 0 0
1 34 0 0 1
2 13 1 0 0
3 40 0 0 1
4 22 0 0 1
5 19 0 1 0
6 36 1 0 0
7 15 0 1 0
8 7 1 0 0
9 17 0 1 0
>>>
# Example 2: Encode all values 'Novice', 'Advanced', and 'Beginner'
# in "programming" column using "dummy" style. Also, pass
# "out_columns" argument, to control the name of the output column.
>>> dc = OneHotEncoder(style="dummy", values=["Novice", "Advanced", "Beginner"],
... columns="programming", out_columns="prog")
# Execute Transform() function.
>>> obj = valib.Transform(data=df, one_hot_encode=dc, key_columns="id")
>>> obj.result
id Novice_prog Advanced_prog Beginner_prog
0 15 0 1 0
1 7 1 0 0
2 22 0 0 1
3 17 0 1 0
4 13 1 0 0
5 38 0 0 1
6 26 0 1 0
7 5 1 0 0
8 34 0 0 1
9 40 0 0 1
>>>
# Example 3: Encode all values 'Novice', 'Advanced', and 'Beginner'
# in "programming" column using "dummy" style. Example shows
# why and how to pass values using dictionary. By passing dictionary,
# we should be able to control the name of the output columns.
# In this example, we would like to name the output column for
# value 'Advanced' as 'Adv', 'Beginner' as 'Beg' and for 'Novice'
# we would like to use default mechanism.
>>> values = {"Novice": None, "Advanced": "Adv", "Beginner": "Beg"}
>>> dc = OneHotEncoder(style="dummy", values=values, columns="programming")
# Execute Transform() function.
>>> obj = valib.Transform(data=df, one_hot_encode=dc, key_columns="id")
>>> obj.result
id Novice_programming Adv Beg
0 13 1 0 0
1 26 0 1 0
2 5 1 0 0
3 19 0 1 0
4 15 0 1 0
5 40 0 0 1
6 7 1 0 0
7 22 0 0 1
8 36 1 0 0
9 38 0 0 1
>>>
# Example 4: Encode all values 'Novice', 'Advanced', and 'Beginner'
# in "programming" column using "dummy" style.
# Example shows controling of the output column name with dictionary
# and "out_columns" argument.
# In this example, we would like to name the output column for
# value 'Advanced' as 'Adv', 'Beginner' as 'Beg', 'Novice' as 'Nov_prog'.
>>> values = {"Novice": None, "Advanced": "Adv", "Beginner": "Beg"}
>>> dc = OneHotEncoder(style="dummy", values=values, columns="programming",
... out_columns="prog")
# Execute Transform() function.
>>> obj = valib.Transform(data=df, one_hot_encode=dc, key_columns="id")
>>> obj.result
id Novice_prog Adv Beg
0 15 0 1 0
1 7 1 0 0
2 22 0 0 1
3 17 0 1 0
4 13 1 0 0
5 38 0 0 1
6 26 0 1 0
7 5 1 0 0
8 34 0 0 1
9 40 0 0 1
>>>
# Example 5: Encode 'yes' value in "masters" column using "contrast" style
# with reference value as 0.
>>> dc = OneHotEncoder(style="contrast", values="yes", reference_value=0,
... columns="masters")
# Execute Transform() function.
>>> obj = valib.Transform(data=df, one_hot_encode=dc, key_columns="id")
>>> obj.result
id yes_masters
0 15 1
1 7 1
2 22 1
3 17 0
4 13 0
5 38 1
6 26 1
7 5 0
8 34 1
9 40 1
>>>
# Example 6: Encode all values in "programming" column using "contrast" style
# with reference_value as 'Advanced'.
>>> values = {"Advanced": "Adv", "Beginner": "Beg", "Novice": "Nov"}
>>> dc = OneHotEncoder(style="contrast", values=values, reference_value="Advanced",
... columns="programming")
# Execute Transform() function.
>>> obj = valib.Transform(data=df, one_hot_encode=dc, key_columns="id")
>>> obj.result
id Adv Beg Nov
0 15 1 -1 -1
1 7 0 0 1
2 22 0 1 0
3 17 1 -1 -1
4 13 0 0 1
5 38 0 1 0
6 26 1 -1 -1
7 5 0 0 1
8 34 0 1 0
9 40 0 1 0
>>>
# Example 7: Example shows combining multiple one hot encoding styles on
# different columns.
# Encode all values in 'programming' column using 'dummy' encoding style.
>>> dc_prog_dummy = OneHotEncoder(values=["Novice", "Advanced", "Beginner"],
... columns="programming", out_columns="prog")
>>>
# Encode all values in 'stats' column using 'dummy' encoding style.
# Also, combine it with null replacement.
>>> values = {"Advanced": "Adv", "Beginner": "Beg"}
>>> fillna = FillNa("literal", "Advanced")
>>> dc_stats_dummy = OneHotEncoder(values=values, columns="stats", fillna=fillna)
>>>
# Encode 'yes' in 'masters' column using 'contrast' encoding style.
# Reference value used is 'no'.
>>> dc_mast_contrast = OneHotEncoder(style="contrast", values="yes", reference_value="no",
... columns="masters")
>>>
# Encode all values in 'programming' column using 'contrast' encoding style.
# Reference value used is 'Advanced'.
>>> dc_prog_contrast = OneHotEncoder(style="contrast",
... values=["Novice", "Advanced", "Beginner"],
... reference_value="Advanced",
... columns="programming")
>>>
# Execute Transform() function.
>>> obj = valib.Transform(data=df,
... one_hot_encode=[dc_prog_dummy, dc_stats_dummy,
... dc_mast_contrast, dc_prog_contrast],
... key_columns="id")
>>> obj.result
id Novice_prog Advanced_prog Beginner_prog Adv Beg yes_masters Novice_programming Advanced_programming Beginner_programming
0 13 1 0 0 1 0 -1 1 0 0
1 26 0 1 0 1 0 1 -1 1 -1
2 5 1 0 0 0 0 -1 1 0 0
3 19 0 1 0 1 0 1 -1 1 -1
4 15 0 1 0 1 0 1 -1 1 -1
5 40 0 0 1 0 0 1 0 0 1
6 7 1 0 0 0 0 1 1 0 0
7 22 0 0 1 0 0 1 0 0 1
8 36 1 0 0 1 0 -1 1 0 0
9 38 0 0 1 1 0 1 0 0 1
>>>