Description
One hot encoding is useful when a categorical data element must be
re-expressed as one or more numeric data elements, creating a binary numeric
field for each categorical data value. One hot encoding supports character,
numeric and date type columns.
One hot encoding is offered in two forms: dummy-coding and contrast-coding.
In dummy-coding, a new column is produced for each listed value, with a value of 0 or 1 depending on whether that value is assumed by the original column. If a column assumes n values, new columns can be created for all n values, (or for only n-1 values, because the nth column is perfectly correlated with the first n-1 columns).
Alternately, given a list of values to contrast-code along with a reference value, a new column is produced for each listed value, with a value of 0 or 1 depending on whether that value is assumed by the original column, or a value of -1 if that original value is equal to the reference value.
Note:
Output of this function is passed to "one.hot.encode" argument of
td_transform_valib()
.
Usage
tdOneHotEncoder(values, column, style="dummy", reference.value=NULL,
out.column=NULL, datatype=NULL, fillna=NULL)
Arguments
values |
Required Argument.
Types: logical, integer, numeric, character, or list of logical, integer numeric, character | ||||||||||||||||||||||||||||||||||||
column |
Required Argument. | ||||||||||||||||||||||||||||||||||||
style |
Optional Argument. | ||||||||||||||||||||||||||||||||||||
reference.value |
Required Argument when "style" is 'contrast',
ignored otherwise. | ||||||||||||||||||||||||||||||||||||
out.column |
Optional Argument. | ||||||||||||||||||||||||||||||||||||
datatype |
Optional Argument.
Notes:
Examples:
Types: character | ||||||||||||||||||||||||||||||||||||
fillna |
Optional Argument.
Types: tdFillNa |
Details
Notes:
Output column names for the transformation using td_transform_valib()
function depends on "values" and "column" arguments.
Here is how output column names are determined:
If "values" is an unnamed list:
If "out.column" is not passed, then output column is formed using the value in "values" and column name passed in "column". For example,
Ifvalues=list("val1", "val2")
andcolumn="col"
then, output column names are:
'val1_col' and 'val2_col'If "out.column" is passed, then output column name is formed using the value passed in "values" and string passed in "out.column". For example,
Ifvalues=list("val1", "val2"), column="col", out.column="ocol"
then, output column names are:
'val1_ocol' and 'val2_ocol'
If "values" is a named list:
If value in a named list is not NULL, then that is used as output column name. For example:
Ifvalues = list("val1"="v1")
then output column name is "v1".If value in a list is NULL, then rules specified in point 1 are applied to determine the output column name.
Value
An object of tdOneHotEncoder class.
Examples
Notes:
# 1. To run any transformation, user needs to use td_transform_valib()
# function.
# 2. To do so set option 'val.install.location' to the database name
# where Vantage analytic library functions are installed.
# 3. Datasets used in these examples can be loaded using Vantage Analytic
# Library installer.
# Get the current context/connection
con <- td_get_context()$connection
# Set the option 'val.install.location'.
options(val.install.location = "SYSLIB")
# Create object(s) of class "tbl_teradata".
admissions_train <- tbl(con, "admissions_train")
# Example 1: Encode all values 'Novice', 'Advanced', and 'Beginner'
# in 'programming' column using 'dummy' style.
dc <- tdOneHotEncoder(values=list("Novice", "Advanced", "Beginner"),
column="programming")
# Perform the one hot encoding transformation using td_transform_valib().
obj <- td_transform_valib(data=admissions_train, one.hot.encode=dc,
key.columns="id")
obj$result
# Example 2: Encode all values 'Novice', 'Advanced', and 'Beginner'
# in 'programming' column using 'dummy' style. Also, pass
# "out.column" argument to control the name of
# the output column.
dc <- tdOneHotEncoder(style="dummy", values=list("Novice", "Advanced",
"Beginner"),
column="programming", out.column="prog")
# Perform the one hot encoding transformation using td_transform_valib().
obj <- td_transform_valib(data=admissions_train, one.hot.encode=dc,
key.columns="id")
obj$result
# Example 3: Encode all values 'Novice', 'Advanced', and 'Beginner'
# in 'programming' column using 'dummy' style. Example shows
# why and how to pass values using a named list. By passing
# a named list, we should be able to control the name of the
# output columns. In this example, we would like to name the
# output column for value 'Advanced' as 'Adv',
# 'Beginner' as 'Beg' and for 'Novice' we would like to use
# default mechanism.
values <- list("Novice"=NULL, "Advanced"="Adv", "Beginner"="Beg")
dc <- tdOneHotEncoder(style="dummy", values=values, column="programming")
# Perform the one hot encoding transformation using td_transform_valib().
obj <- td_transform_valib(data=admissions_train, one.hot.encode=dc,
key.columns="id")
obj$result
# Example 4: Encode all values 'Novice', 'Advanced', and 'Beginner'
# in 'programming' column using 'dummy' style. Example shows
# controlling of the output column name with "out.column"
# argument. In this example, we would like to name the output
# column for value 'Advanced' as 'Adv', 'Beginner' as 'Beg' and
# 'Novice' as 'Novice_prog'.
values <- list("Novice"=NULL, "Advanced"="Adv", "Beginner"="Beg")
dc <- tdOneHotEncoder(style="dummy", values=values,
column="programming", out.column="prog")
# Perform the one hot encoding transformation using td_transform_valib().
obj <- td_transform_valib(data=admissions_train, one.hot.encode=dc,
key.columns="id")
obj$result
# Example 5: Encode 'yes' value in 'masters' column using 'contrast' style
# with reference value as 0.
dc <- tdOneHotEncoder(style="contrast", values="yes", reference.value=0,
column="masters")
# Perform the one hot encoding transformation using td_transform_valib().
obj <- td_transform_valib(data=admissions_train, one.hot.encode=dc,
key.columns="id")
obj$result
# Example 6: Encode all values in 'programming' column using 'contrast' style
# with reference value as 'Advanced'.
values <- list("Advanced"="Adv", "Beginner"="Beg", "Novice"="Nov")
dc <- tdOneHotEncoder(style="contrast", values=values,
reference.value="Advanced", column="programming")
# Perform the one hot encoding transformation using td_transform_valib().
obj <- td_transform_valib(data=admissions_train, one.hot.encode=dc,
key.columns="id")
obj$result
# Example 7: Example shows combining multiple one hot encoding styles on
# different columns and performing the transformation using
# td_transform_valib() function from Vantage Analytic Library.
# Encode all values in 'programming' column using 'dummy' encoding style.
dc_prog_dummy <- tdOneHotEncoder(values=list("Novice", "Advanced",
"Beginner"),
column="programming", out.column="prog")
# Encode all values in 'stats' column using 'dummy' encoding style. Also, we
# will combine it with null replacement.
values <- list("Advanced"="Adv", "Beginner"="Beg")
fillna <- tdFillNa(style="literal", value="Advanced")
dc_stats_dummy <- tdOneHotEncoder(values=values, column="stats",
fillna=fillna)
# Encode 'yes' in 'masters' column using 'contrast' encoding style.
# Reference value used is 'no'.
dc_mast_contrast <- tdOneHotEncoder(style="contrast", values="yes",
reference.value="no", column="masters")
# Encode all values in 'programming' column using 'contrast' encoding style.
# Reference value used is 'Advanced'.
dc_prog_contrast <- tdOneHotEncoder(style="contrast",
values=list("Novice", "Advanced",
"Beginner"),
reference.value="Advanced",
column="programming")
# Perform the one hot encoding transformation using td_transform_valib().
obj <- td_transform_valib(data=admissions_train,
one.hot.encode=c(dc_prog_dummy,
dc_stats_dummy,
dc_mast_contrast,
dc_prog_contrast),
key.columns="id")
obj$result