TD_OneHotEncodingFit Usage Notes | OneHotEncodingFit - TD_OneHotEncodingFit Usage Notes - Analytics Database

Database Analytic Functions

Analytics Database
Release Number
June 2022
English (United States)
Last Update
Product Category
Teradata Vantageā„¢

One hot encoding is a technique used to represent categorical data as numerical data. It involves creating a binary vector for each category or level of a categorical variable, with each vector having a length equal to the number of possible categories.

In this technique, a value of 1 is assigned to the corresponding category for a particular observation and a value of 0 is assigned to all other categories. This results in a matrix of 1's and 0's, where each row represents a single observation and each column represents a category.

For example, if we have a categorical variable "Color" with three possible values - red, blue, and green - we would create three binary vectors, one for each color. If an observation is red, then the corresponding binary vector would be [1,0,0]. If it is blue, the vector would be [0,1,0]. And if it is green, the vector would be [0,0,1].

One hot encoding is commonly used in machine learning algorithms, such as logistic regression and neural networks, as these algorithms typically require numerical data as input. It can also help in reducing the potential bias that can result from using ordinal encoding (where values are assigned numerical codes based on their order or rank) or label encoding (where values are assigned numerical codes based on their frequency of occurrence).

For example, we have the following dataset:

Fruit Categorial Value pf Fruit Price
apple 1 5
mango 2 10
apple 1 15
orange 3 20

Assuming we want to one hot encode the column Fruit, how do we go about it? We have three different values in this column: apple, mango, and orange. This means that three new columns will be introduced as a result of our one hot encoding. Following is the result of running this encoding on the dataset:

apple mango orange price
1 0 0 5
0 1 0 10
1 0 0 15
0 0 1 20

The column Categorical Value of Fruit is omitted from the result shown. This does not necessarily need to be the case and it is for our ease of understanding only. All rows that had apple in the original dataset now have a 1 in the column apple and a 0 otherwise. Similarly, all rows that had mango now have a 1 in the mango column and 0 otherwise. And so on, for all other fruits in our fruit column.