Teradata Package for Python Function Reference | 17.10 - LabelEncoder - Teradata Package for Python - Look here for syntax, methods and examples for the functions included in the Teradata Package for Python.
Teradata® Package for Python Function Reference
- Product
- Teradata Package for Python
- Release Number
- 17.10
- Published
- April 2022
- Language
- English (United States)
- Last Update
- 2022-08-19
- lifecycle
- previous
- Product Category
- Teradata Vantage
- teradataml.analytics.Transformations.LabelEncoder.__init__ = __init__(self, values, columns, default=None, out_columns=None, datatype=None, fillna=None)
- DESCRIPTION:
Label encoding a categorical data column is done to re-express existing values
of a column (variable) into a new coding scheme or to correct data quality
problems and focus an analysis of a particular value. It allows for mapping
individual values, NULL values, or any number of remaining values (ELSE
option) to a new value, a NULL value or the same value.
Label encoding supports character, numeric, and date type columns.
Note:
Output of this function is passed to "label_encode" argument of "Transform"
function from Vantage Analytic Library.
PARAMETERS:
values:
Required Argument.
Specifies the values to be label encoded. Values can be specified in
two formats:
1. A list of two-tuples, where first value in the tuple is a
old value and second value is a new value.
For example,
values = [(old_val1, new_val2), (old_val2, new_val2)]
2. A dictionary with key as old value and value as new value.
For example,
values = {old_val1: new_val2, old_val2: new_val2}
Note:
1. If date values are entered as string, the keyword 'DATE' must precede
the date value, and do not enclose in single quotes OR
pass a datetime.date object.
For example,
value='DATE 1987-06-09'
value=date(1987, 6, 9)
2. To keep the old value as is, one can pass 'same' as it's new value.
3. To use NULL values for old or new value, one can either use string
'null' or None.
Types: two-tuple, list of two-tuples, dict
columns:
Required Argument.
Specifies the names of the columns containing values to be label encoded.
Types: str or list of str
default:
Optional Argument.
Specifies the value assumed for all other cases.
Permitted Values: None, 'SAME', 'NULL', a literal
Default Value: None
Types: bool, float, int, str
out_columns:
Optional Argument.
Specifies the names of the output columns. Value passed to this argument
also plays a crucial role in determining the output column name.
Note:
Number of elements in "columns" and "out_columns" must be same.
Types: str or list of str
datatype:
Optional Argument.
Specifies the name of the intended datatype of the output column.
Intended data types for the output column can be specified using either the
teradatasqlalchemy types or the permitted strings mentioned below:
-------------------------------------------------------------------
| If intended SQL Data Type is | Permitted Value to be passed is |
|-------------------------------------------------------------------|
| bigint | bigint |
| byteint | byteint |
| char(n) | char,n |
| date | date |
| decimal(m,n) | decimal,m,n |
| float | float |
| integer | integer |
| number(*) | number |
| number(n) | number,n |
| number(*,n) | number,*,n |
| number(n,n) | number,n,n |
| smallint | smallint |
| time(p) | time,p |
| timestamp(p) | timestamp,p |
| varchar(n) | varchar,n |
--------------------------------------------------------------------
Notes:
1. Argument is ignored if "columns" argument is not used.
2. char without a size is not supported.
3. number(*) does not include the * in its datatype format.
Examples:
1. If intended datatype for the output column is "bigint", then
pass string "bigint" to the argument as shown below:
datatype="bigint"
2. If intended datatype for the output column is "decimal(3,5)", then
pass string "decimal,3,5" to the argument as shown below:
datatype="decimal,3,5"
Types: str, BIGINT, BYTEINT, CHAR, DATE, DECIMAL, FLOAT, INTEGER, NUMBER, SMALLINT, TIME,
TIMESTAMP, VARCHAR.
fillna:
Optional Argument.
Specifies whether the null replacement/missing value treatment should
be performed with recoding or not. Output of FillNa() can be passed to
this argument.
Note:
If the FillNa object is created with its arguments "columns",
"out_columns" and "datatype", then values passed in FillNa() arguments
are ignored. Only nullstyle information is captured from the same.
Types: FillNa
RETURNS:
An instance of LabelEncoder class.
RAISES:
TeradataMlException, TypeError, ValueError
EXAMPLE:
# Note:
# To run any transformation, user needs to use Transform() function from
# Vantage Analytic Library.
# To do so import valib first and set the "val_install_location".
>>> from teradataml import configure, DataFrame, LabelEncoder, FillNa, load_example_data, valib
>>> configure.val_install_location = "SYSLIB"
>>>
# Load example data.
>>> load_example_data("dataframe", "admissions_train")
>>>
# Create the required DataFrame.
>>> admissions_train = DataFrame("admissions_train")
>>> admissions_train
masters gpa stats programming admitted
id
13 no 4.00 Advanced Novice 1
26 yes 3.57 Advanced Advanced 1
5 no 3.44 Novice Novice 0
19 yes 1.98 Advanced Advanced 0
15 yes 4.00 Advanced Advanced 1
40 yes 3.95 Novice Beginner 0
7 yes 2.33 Novice Novice 1
22 yes 3.46 Novice Beginner 0
36 no 3.00 Advanced Novice 0
38 yes 2.65 Advanced Beginner 1
>>>
# Example 1: Recode all values 'Novice', 'Advanced', and 'Beginner'
# in "programming" and "stats" columns.
# We will pass values to "label_encode" as dictionary.
>>> rc = LabelEncoder(values={"Novice": 1, "Advanced": 2, "Beginner": 3}, columns=["stats", "programming"])
# Execute Transform() function.
>>> obj = valib.Transform(data=admissions_train, label_encode=rc)
>>> obj.result
id stats programming
0 22 1 3
1 36 2 1
2 15 2 2
3 38 2 3
4 5 1 1
5 17 2 2
6 34 2 3
7 13 2 1
8 26 2 2
9 19 2 2
>>>
# Example 2: Recode value 'Novice' as 1 which is passed as tuple to "values"
# argument and "label_encode" other values as 0 by passing it to "default"
# argument in "programming" and "stats" columns.
>>> rc = LabelEncoder(values=("Novice", 1), columns=["stats", "programming"], default=0)
# Execute Transform() function.
>>> obj = valib.Transform(data=admissions_train, label_encode=rc)
>>> obj.result
id stats programming
0 15 0 0
1 7 1 1
2 22 1 0
3 17 0 0
4 13 0 1
5 38 0 0
6 26 0 0
7 5 1 1
8 34 0 0
9 40 1 0
>>>
# Example 3: In this example we encode values differently for multiple columns.
# For values in "programming" column, recoding will be done as follows:
# Novice --> 0
# Advanced --> 1 and
# Rest of the values as --> NULL
>>> rc_prog = LabelEncoder(values=[("Novice", 0), ("Advanced", 1)], columns="programming",
... default=None)
>>>
# For values in "stats" column, recoding will be done as follows:
# Novice --> N
# Advanced --> keep it as is and
# Beginner --> NULL
>>> rc_stats = LabelEncoder(values={"Novice": 0, "Advanced": "same", "Beginner": None},
... columns="stats")
>>>
# For values in "masters" column, recoding will be done as follows:
# yes --> 1 and other as 0
>>> rc_yes = LabelEncoder(values=("yes", 1), columns="masters", default=0,
... out_columns="masters_yes")
>>>
# For values in "masters" column, label encoding will be done as follows:
# no --> 1 and other as 0
>>> rc_no = LabelEncoder(values=("no", 1), columns="masters", default=0,
... out_columns="masters_no")
>>>
# Execute Transform() function.
>>> obj = valib.Transform(data=admissions_train, label_encode=[rc_prog, rc_stats, rc_yes,
... rc_no])
>>> obj.result
id programming stats masters_yes masters_no
0 13 0 Advanced 0 1
1 26 1 Advanced 1 0
2 5 0 0 0 1
3 19 1 Advanced 1 0
4 15 1 Advanced 1 0
5 40 None 0 1 0
6 7 0 0 1 0
7 22 None 0 1 0
8 36 0 Advanced 0 1
9 38 None Advanced 1 0
>>>