Teradata Package for R Function Reference | 17.00 - 17.00 - tdOneHotEncoder - Teradata Package for R

Teradata® Package for R Function Reference

Product
Teradata Package for R
Release Number
17.00
Release Date
July 2021
Content Type
Programming Reference
Publication ID
B700-4007-090K
Language
English (United States)

Description

One hot encoding is useful when a categorical data element must be re-expressed as one or more numeric data elements, creating a binary numeric field for each categorical data value. One hot encoding supports character, numeric and date type columns.
One hot encoding is offered in two forms: dummy-coding and contrast-coding.

  1. In dummy-coding, a new column is produced for each listed value, with a value of 0 or 1 depending on whether that value is assumed by the original column. If a column assumes n values, new columns can be created for all n values, (or for only n-1 values, because the nth column is perfectly correlated with the first n-1 columns).

  2. Alternately, given a list of values to contrast-code along with a reference value, a new column is produced for each listed value, with a value of 0 or 1 depending on whether that value is assumed by the original column, or a value of -1 if that original value is equal to the reference value.

Note:

  • Output of this function is passed to "one.hot.encode" argument of td_transform_valib().

Usage

tdOneHotEncoder(values, column, style="dummy", reference.value=NULL,
                out.column=NULL, datatype=NULL, fillna=NULL)

Arguments

values

Required Argument.
Specifies the values to code and optionally the name of the resulting output column.
Note:

  1. If date values are entered, the keyword 'DATE' must precede the date value, and do not enclose in single quotes.
    For example,
    values='DATE 1987-06-09'

  2. Use a named list to pass value when result output column is to be named. Key in the named list must be the value to code and value must be either NULL or not specified, in case result output column is not to be named or a string if it is to be named.
    For example,
    values = c("Male"="M", "Female"=NULL)
    In the example above,

    • we would like to name the output column as 'M' for one hot encoded values for 'Male' and

    • for the one hot encoding values of 'Female' we would like to have the output name as 'Female', hence we pass NULL.

    For example,
    values = c("Male"="M", "Female")
    In the example above,

    • we would like to name the output column as 'M' for one hot encoded values for 'Male' and

    • for the one hot encoding values of 'Female' we would like to have the output name as 'Female', hence we do not pass value.

Types: logical, integer, numeric, character, or list of logical, integer numeric, character

column

Required Argument.
Specifies the name of the column on which one hot encoding is to be performed.
Types: character

style

Optional Argument.
Specifies the one hot encoding style to use.
Permitted Values: 'dummy', 'contrast'
Default Value: 'dummy'
Types: character

reference.value

Required Argument when "style" is 'contrast', ignored otherwise.
Specifies the reference value to use for 'contrast' style. If the original value in the column is equal to the reference value then -1 is returned for the same.
Types: logical, integer, numeric, character

out.column

Optional Argument.
Specifies the name of the output column. Value passed to this argument also plays a crucial role in determining the output column name
Types: character

datatype

Optional Argument.
Specifies the name of the intended datatype of the output column.
Intended data types for the output column can be specified using the permitted strings below:

------------------------------------ ---------------------------------------
If intended SQL Data Type is Permitted Value to be passed is
------------------------------------ ---------------------------------------
bigint bigint
byteint byteint
char(n) char,n
date date
decimal(m,n) decimal,m,n
float float
integer integer
number(*) number
number(n) number,n
number(*,n) number,*,n
number(n,n) number,n,n
smallint smallint
time(p) time,p
timestamp(p) timestamp,p
varchar(n) varchar,n

Notes:

  1. Argument is ignored if "out.column" argument is not used.

  2. char without a size is not supported.

  3. number(*) does not include the * in its datatype format.

Examples:

  1. If intended datatype for the output column is 'bigint', then pass string 'bigint' to the argument as shown below:
    datatype="bigint"

  2. If intended datatype for the output column is 'decimal(3,5)', then pass string 'decimal,3,5' to the argument as shown below:
    datatype="decimal,3,5"

Types: character

fillna

Optional Argument.
Specifies whether the null replacement/missing value treatment should be performed with one hot encoding or not. Output of tdFillNa() can be passed to this argument.
Note:

  • If the tdFillNa object is created with its arguments "column" and "datatype", then values passed in tdFillNa() arguments are ignored. Only nullstyle information is captured from the same.

Types: tdFillNa

Details

Notes:
Output column names for the transformation using td_transform_valib() function depends on "values" and "column" arguments. Here is how output column names are determined:

  1. If "values" is an unnamed list:

    • If "out.column" is not passed, then output column is formed using the value in "values" and column name passed in "column". For example,
      If values=list("val1", "val2") and column="col" then, output column names are:
      'val1_col' and 'val2_col'

    • If "out.column" is passed, then output column name is formed using the value passed in "values" and string passed in "out.column". For example,
      If values=list("val1", "val2"), column="col", out.column="ocol" then, output column names are:
      'val1_ocol' and 'val2_ocol'

  2. If "values" is a named list:

    • If value in a named list is not NULL, then that is used as output column name. For example:
      If values = list("val1"="v1") then output column name is "v1".

    • If value in a list is NULL, then rules specified in point 1 are applied to determine the output column name.

Value

An object of tdOneHotEncoder class.

Examples

Notes:
# 1. To run any transformation, user needs to use td_transform_valib()
#    function.
# 2. To do so set option 'val.install.location' to the database name
#    where Vantage analytic library functions are installed.
# 3. Datasets used in these examples can be loaded using Vantage Analytic
#    Library installer.

# Get the current context/connection
con <- td_get_context()$connection

# Set the option 'val.install.location'.
options(val.install.location = "SYSLIB")

# Create object(s) of class "tbl_teradata".
admissions_train <- tbl(con, "admissions_train")

# Example 1: Encode all values 'Novice', 'Advanced', and 'Beginner'
#            in 'programming' column using 'dummy' style.
dc <- tdOneHotEncoder(values=list("Novice", "Advanced", "Beginner"),
                      column="programming")

# Perform the one hot encoding transformation using td_transform_valib().
obj <- td_transform_valib(data=admissions_train, one.hot.encode=dc,
                          key.columns="id")
obj$result

# Example 2: Encode all values 'Novice', 'Advanced', and 'Beginner'
#            in 'programming' column using 'dummy' style. Also, pass
#            "out.column" argument to control the name of
#            the output column.
dc <- tdOneHotEncoder(style="dummy", values=list("Novice", "Advanced",
                                                 "Beginner"),
                      column="programming", out.column="prog")

# Perform the one hot encoding transformation using td_transform_valib().
obj <- td_transform_valib(data=admissions_train, one.hot.encode=dc,
                          key.columns="id")
obj$result

# Example 3: Encode all values 'Novice', 'Advanced', and 'Beginner'
#            in 'programming' column using 'dummy' style. Example shows
#            why and how to pass values using a named list. By passing
#            a named list, we should be able to control the name of the
#            output columns. In this example, we would like to name the
#            output column for value 'Advanced' as 'Adv',
#            'Beginner' as 'Beg' and for 'Novice' we would like to use
#            default mechanism.
values <- list("Novice"=NULL, "Advanced"="Adv", "Beginner"="Beg")
dc <- tdOneHotEncoder(style="dummy", values=values, column="programming")

# Perform the one hot encoding transformation using td_transform_valib().
obj <- td_transform_valib(data=admissions_train, one.hot.encode=dc,
                          key.columns="id")
obj$result

# Example 4: Encode all values 'Novice', 'Advanced', and 'Beginner'
#            in 'programming' column using 'dummy' style. Example shows
#            controlling of the output column name with "out.column"
#            argument. In this example, we would like to name the output
#            column for value 'Advanced' as 'Adv', 'Beginner' as 'Beg' and
#            'Novice' as 'Novice_prog'.
values <- list("Novice"=NULL, "Advanced"="Adv", "Beginner"="Beg")
dc <- tdOneHotEncoder(style="dummy", values=values,
                      column="programming", out.column="prog")

# Perform the one hot encoding transformation using td_transform_valib().
obj <- td_transform_valib(data=admissions_train, one.hot.encode=dc,
                          key.columns="id")
obj$result

# Example 5: Encode 'yes' value in 'masters' column using 'contrast' style
#            with reference value as 0.
dc <- tdOneHotEncoder(style="contrast", values="yes", reference.value=0,
                      column="masters")

# Perform the one hot encoding transformation using td_transform_valib().
obj <- td_transform_valib(data=admissions_train, one.hot.encode=dc,
                          key.columns="id")
obj$result

# Example 6: Encode all values in 'programming' column using 'contrast' style
#            with reference value as 'Advanced'.
values <- list("Advanced"="Adv", "Beginner"="Beg", "Novice"="Nov")
dc <- tdOneHotEncoder(style="contrast", values=values,
                      reference.value="Advanced", column="programming")

# Perform the one hot encoding transformation using td_transform_valib().
obj <- td_transform_valib(data=admissions_train, one.hot.encode=dc,
                          key.columns="id")
obj$result

# Example 7: Example shows combining multiple one hot encoding styles on
#            different columns and performing the transformation using
#            td_transform_valib() function from Vantage Analytic Library.
# Encode all values in 'programming' column using 'dummy' encoding style.
dc_prog_dummy <- tdOneHotEncoder(values=list("Novice", "Advanced",
                                             "Beginner"),
                                 column="programming", out.column="prog")

# Encode all values in 'stats' column using 'dummy' encoding style. Also, we
# will combine it with null replacement.
values <- list("Advanced"="Adv", "Beginner"="Beg")
fillna <- tdFillNa(style="literal", value="Advanced")
dc_stats_dummy <- tdOneHotEncoder(values=values, column="stats",
                                  fillna=fillna)

# Encode 'yes' in 'masters' column using 'contrast' encoding style.
# Reference value used is 'no'.
dc_mast_contrast <- tdOneHotEncoder(style="contrast", values="yes",
                                    reference.value="no", column="masters")

# Encode all values in 'programming' column using 'contrast' encoding style.
# Reference value used is 'Advanced'.
dc_prog_contrast <- tdOneHotEncoder(style="contrast",
                                    values=list("Novice", "Advanced",
                                                "Beginner"),
                                    reference.value="Advanced",
                                    column="programming")

# Perform the one hot encoding transformation using td_transform_valib().
obj <- td_transform_valib(data=admissions_train,
                          one.hot.encode=c(dc_prog_dummy,
                                           dc_stats_dummy,
                                           dc_mast_contrast,
                                           dc_prog_contrast),
                          key.columns="id")
obj$result