Teradata Package for R Function Reference | 17.20 - TargetEncodingFit - Teradata Package for R - Look here for syntax, methods and examples for the functions included in the Teradata Package for R.

Teradata® Package for R Function Reference

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Teradata Package for R
Release Number
17.20
Published
March 2024
ft:locale
en-US
ft:lastEdition
2024-05-03
dita:id
TeradataR_FxRef_Enterprise_1720
lifecycle
latest
Product Category
Teradata Vantage

TargetEncodingFit

Description

The td_target_encoding_fit_sqle() function generally uses the likelihood or expected value of the target variable for each category and encodes that category with that value. This technique works for both binary classification and regression and for multiclass classification a similar technique is applied, which encodes the categorical variable with k new variables, where k is the number of classes.

The td_target_encoding_fit_sqle() function takes the input data and a categorical data as input and generates the required hyperparameters, which will be used by the td_target_encoding_transform_sqle() function for encoding the categorical values.

Notes:

  • This function requires the UTF8 client character set.

  • This function does not support Pass-Through Characters (PTCs).

  • This function does not support KanjiSJIS or Graphic data types.

  • The maximum number of unique categories in the particular column is 4000.

  • The maximum category length is 128 characters.

  • Columns with a large number of distinct categories can have an impact on query execution time.

Usage considerations for td_target_encoding_fit_sqle() function are:

  • The input data in the td_target_encoding_fit_sqle() function can have no partition at all or have data_partition_column="ANY" .

  • The td_target_encoding_fit_sqle() function requires a category data to be passed as a dimension. The category data should be generated by the td_categorical_summary_sqle() function.

  • Null categories will not be encoded.

  • The "default.values" argument should be provided to td_target_encoding_fit_sqle() if user want to assign any target value for missing categories in the td_target_encoding_transform_sqle() function.

Usage

  td_target_encoding_fit_sqle (
      data = NULL,
      category.data = NULL,
      encoder.method = NULL,
      target.columns = NULL,
      response.column = NULL,
      alpha.prior = NULL,
      beta.prior = NULL,
      alpha.priors = NULL,
      num.distinct.responses = NULL,
      u0.prior = NULL,
      v0.prior = NULL,
      alpha0.prior = NULL,
      beta0.prior = NULL,
      default.values = NULL,
      ...
  )

Arguments

data

Required Argument.
Specifies the input data containing the categorical target columns.
Types: tbl_teradata

category.data

Required Argument.
Specifies the data containing the unique categories and their counts for each target columns.
Types: tbl_teradata

encoder.method

Required Argument.
Specifies the encoder method:

  • If the response variable is following a binary classification, for example, values are either 0 or 1, use "encoder.method" as 'CBM_BETA'.

  • If the response variable is following a multi-class classification, for example, values are (1,...,k, where k is the number of classes), use "encoder.method" as 'CBM_DIRICHLET'.

  • If the response variable is following a regression, for example, values are contiguous numeric values, use "encoder.method" as 'CBM_GAUSSIAN_INVERSE_GAMMA'.

Notes:

  • The maximum length supported is 128.

  • "encoder.method" are not case sensitive.

Permitted Values: "CBM_BETA", "CBM_DIRICHLET", "CBM_GAUSSIAN_INVERSE_GAMMA"
Types: character

target.columns

Required Argument.
Specifies the column from the "data" that contains the categorical values to be encoded.
Notes:

  • The maximum length supported is 128.

  • The maximum list length is 2018.

  • "target.columns" are not case sensitive.

Types: character OR vector of Strings (character)

response.column

Required Argument.
Specifies column from the "data" that contains the response values.
Notes:

  • The maximum length supported is 128.

  • "response.column" are not case sensitive.

Types: character

alpha.prior

Optional Argument.
Specifies the prior parameter of the 'CBM_BETA' encoder method.
Types: integer

beta.prior

Optional Argument.
Specifies the prior parameter of the 'CBM_BETA' encoder method.
Types: integer

alpha.priors

Optional Argument.
Specifies the prior parameter of the 'CBM_DIRICHLET' encoder method.
Notes:

  • The number of values specified in this argument must be equal to "num.distinct.responses" value.

  • The maximum list length is 2018.

Types: integer OR vector of integers

num.distinct.responses

Required when "encoder.method" is 'CBM_DIRICHLET', optional otherwise.
Specifies the number of distinct values present in the "response.column".
Types: integer

u0.prior

Optional Argument.
Specifies the prior parameter of the 'CBM_GAUSSIAN_INVERSE_GAMMA'
encoder method.
Types: integer

v0.prior

Optional Argument.
Specifies the prior parameter of the 'CBM_GAUSSIAN_INVERSE_GAMMA'
encoder method.
Types: integer

alpha0.prior

Optional Argument.
Specifies the prior parameter of the 'CBM_GAUSSIAN_INVERSE_GAMMA'
encoder method.
Types: integer

beta0.prior

Optional Argument.
Specifies the prior parameter of the 'CBM_GAUSSIAN_INVERSE_GAMMA'
encoder method.
Types: integer

default.values

Optional Argument.
Specifies the values to use when the category is not found during transform.
When only one value is specified, it will be applied to all the target columns, otherwise the number of default values must be equal to the number of target columns.
Note:

  • The maximum list length is 2018.

Types: integer OR vector of integers

...

Specifies the generic keyword arguments SQLE functions accept. Below are the generic keyword arguments:

persist:
Optional Argument.
Specifies whether to persist the results of the
function in a table or not. When set to TRUE, results are persisted in a table; otherwise, results are garbage collected at the end of the session.
Default Value: FALSE
Types: logical

volatile:
Optional Argument.
Specifies whether to put the results of the function in a volatile table or not. When set to TRUE, results are stored in a volatile table, otherwise not.
Default Value: FALSE
Types: logical

Function allows the user to partition, hash, order or local order the input data. These generic arguments are available for each argument that accepts tbl_teradata as input and can be accessed as:

  • "<input.data.arg.name>.partition.column" accepts character or vector of character (Strings)

  • "<input.data.arg.name>.hash.column" accepts character or vector of character (Strings)

  • "<input.data.arg.name>.order.column" accepts character or vector of character (Strings)

  • "local.order.<input.data.arg.name>" accepts logical

Note:
These generic arguments are supported by tdplyr if the underlying SQL Engine function supports, else an exception is raised.

Value

Function returns an object of class "td_target_encoding_fit_sqle" which is a named list containing object of class "tbl_teradata".
Named list member(s) can be referenced directly with the "$" operator using the name(s):

  1. result

  2. output.data

Examples

  
    
    # Get the current context/connection.
    con <- td_get_context()$connection
    
    # Load the example data.
    loadExampleData("tdplyr_example", "titanic")
    
    # Create tbl_teradata object.
    data_input <- tbl(con, "titanic")
    
    # Check the list of available analytic functions.
    display_analytic_functions()
    
    # Find the distinct values and counts for column 'sex' and 'embarked'.
    res <- td_categorical_summary_sqle(data = data_input,
                                       target.columns = c("sex", "embarked"))
    
    # Find the distinct count of 'sex' and 'embarked' in which only 
    # 2 column should be present name 'ColumnName' and 'CategoryCount'.
    category_data <- res$result 
                    group_by(ColumnName) 
                    summarize(CategoryCount = n())
    
    # Example 1 : Generates the required hyperparameters when "encoder.method" is
    #             'CBM_BETA'.
    TargetEncodingFit_out1 <- td_target_encoding_fit_sqle(
                                data = data_input,
                                category.data = category_data,
                                encoder.method = 'CBM_BETA',
                                target.columns = c('sex', 'embarked'),
                                response.column = 'survived',
                                default.values = c(-1, -2))
    # Print the result.
    print(TargetEncodingFit_out1$result)
    print(TargetEncodingFit_out1$output.data)
    
    # Example 2 : Generates the required hyperparameters when "encoder.method"
    #             is 'CBM_DIRICHLET'.
    TargetEncodingFit_out2 <- td_target_encoding_fit_sqle(
                                data = data_input,
                                category.data = category_data,
                                encoder.method = 'CBM_DIRICHLET',
                                target.columns = c('sex', 'embarked'),
                                response.column = 'pclass',
                                num.distinct.responses = 3)
    # Print the result.
    print(TargetEncodingFit_out2$result)
    print(TargetEncodingFit_out2$output.data)
    
    # Example 3 : Generates the required hyperparameters when "encoder.method"
    #             is 'CBM_GAUSSIAN_INVERSE_GAMMA'.
    TargetEncodingFit_out3 <- td_target_encoding_fit_sqle(
                                data = data_input,
                                category.data = category_data,
                                encoder.method = 'CBM_GAUSSIAN_INVERSE_GAMMA',
                                target.columns = c('sex', 'embarked'),
                                response.column = 'age',
                                default.values = c(-1, -2))
    
    # Print the result.
    print(TargetEncodingFit_out3$result)
    print(TargetEncodingFit_out3$output.data)