TD_OneHotEncodingFit Usage Notes | OneHotEncodingFit - TD_OneHotEncodingFit Usage Notes - Teradata Vantage

Teradata® VantageCloud Lake

Deployment
VantageCloud
Edition
Lake
Product
Teradata Vantage
Published
January 2023
ft:locale
en-US
ft:lastEdition
2024-12-11
dita:mapPath
phg1621910019905.ditamap
dita:ditavalPath
pny1626732985837.ditaval
dita:id
phg1621910019905

One hot encoding is a technique used to represent categorical data as numerical data. It involves creating a binary vector for each category or level of a categorical variable, with each vector having a length equal to the number of possible categories.

In this technique, a value of 1 is assigned to the corresponding category for a particular observation and a value of 0 is assigned to all other categories. This results in a matrix of 1's and 0's, where each row represents a single observation and each column represents a category.

For example, if we have a categorical variable "Color" with three possible values - red, blue, and green - we would create three binary vectors, one for each color. If an observation is red, then the corresponding binary vector would be [1,0,0]. If it is blue, the vector would be [0,1,0]. And if it is green, the vector would be [0,0,1].

One hot encoding is commonly used in machine learning algorithms, such as logistic regression and neural networks, as these algorithms typically require numerical data as input. It can also help in reducing the potential bias that can result from using ordinal encoding (where values are assigned numerical codes based on their order or rank) or label encoding (where values are assigned numerical codes based on their frequency of occurrence).

For example, we have the following dataset:

Fruit Categorical Value pf Fruit Price
apple 1 5
mango 2 10
apple 1 15
orange 3 20

Assuming we want to one hot encode the column Fruit, how do we go about it? We have three different values in this column: apple, mango, and orange. This means that three new columns will be introduced as a result of our one hot encoding. Following is the result of running this encoding on the dataset:

apple mango orange price
1 0 0 5
0 1 0 10
1 0 0 15
0 0 1 20

The column Categorical Value of Fruit is omitted from the result shown. This does not necessarily need to be the case and it is for our ease of understanding only. All rows that had apple in the original dataset now have a 1 in the column apple and a 0 otherwise. Similarly, all rows that had mango now have a 1 in the mango column and 0 otherwise. And so on, for all other fruits in our fruit column.