TD_OneHotEncodingFit Usage Notes | OneHotEncodingFit - TD_OneHotEncodingFit Usage Notes - Analytics Database

Database Analytic Functions

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Analytics Database
Release Number
17.20
Published
June 2022
Language
English (United States)
Last Update
2024-04-06
dita:mapPath
gjn1627595495337.ditamap
dita:ditavalPath
ayr1485454803741.ditaval
dita:id
jmh1512506877710
Product Category

One hot encoding is a technique used to represent categorical data as numerical data. It involves creating a binary vector for each category or level of a categorical variable, with each vector having a length equal to the number of possible categories.

In this technique, a value of 1 is assigned to the corresponding category for a particular observation and a value of 0 is assigned to all other categories. This results in a matrix of 1's and 0's, where each row represents a single observation and each column represents a category.

For example, if we have a categorical variable "Color" with three possible values - red, blue, and green - we would create three binary vectors, one for each color. If an observation is red, then the corresponding binary vector would be [1,0,0]. If it is blue, the vector would be [0,1,0]. And if it is green, the vector would be [0,0,1].

One hot encoding is commonly used in machine learning algorithms, such as logistic regression and neural networks, as these algorithms typically require numerical data as input. It can also help in reducing the potential bias that can result from using ordinal encoding (where values are assigned numerical codes based on their order or rank) or label encoding (where values are assigned numerical codes based on their frequency of occurrence).

For example, we have the following dataset:

Fruit Categorial Value pf Fruit Price
apple 1 5
mango 2 10
apple 1 15
orange 3 20

Assuming we want to one hot encode the column Fruit, how do we go about it? We have three different values in this column: apple, mango, and orange. This means that three new columns will be introduced as a result of our one hot encoding. Following is the result of running this encoding on the dataset:

apple mango orange price
1 0 0 5
0 1 0 10
1 0 0 15
0 0 1 20

The column Categorical Value of Fruit is omitted from the result shown. This does not necessarily need to be the case and it is for our ease of understanding only. All rows that had apple in the original dataset now have a 1 in the column apple and a 0 otherwise. Similarly, all rows that had mango now have a 1 in the mango column and 0 otherwise. And so on, for all other fruits in our fruit column.