TargetEncodingFit
Description
The td_target_encoding_fit_sqle()
function generally uses the likelihood or expected
value of the target variable for each category and encodes that category with
that value. This technique works for both binary classification and regression
and for multiclass classification a similar technique is applied, which encodes
the categorical variable with k new variables, where k is the number of classes.
The td_target_encoding_fit_sqle()
function takes the input data and a categorical data as
input and generates the required hyperparameters, which will be used by the
td_target_encoding_transform_sqle()
function for encoding the categorical values.
Notes:
This function requires the UTF8 client character set.
This function does not support Pass-Through Characters (PTCs).
This function does not support KanjiSJIS or Graphic data types.
The maximum number of unique categories in the particular column is 4000.
The maximum category length is 128 characters.
Columns with a large number of distinct categories can have an impact on query execution time.
Usage considerations for td_target_encoding_fit_sqle()
function are:
The input data in the
td_target_encoding_fit_sqle()
function can have no partition at all or have data_partition_column="ANY" .The
td_target_encoding_fit_sqle()
function requires a category data to be passed as a dimension. The category data should be generated by thetd_categorical_summary_sqle()
function.Null categories will not be encoded.
The "default.values" argument should be provided to
td_target_encoding_fit_sqle()
if user want to assign any target value for missing categories in thetd_target_encoding_transform_sqle()
function.
Usage
td_target_encoding_fit_sqle (
data = NULL,
category.data = NULL,
encoder.method = NULL,
target.columns = NULL,
response.column = NULL,
alpha.prior = NULL,
beta.prior = NULL,
alpha.priors = NULL,
num.distinct.responses = NULL,
u0.prior = NULL,
v0.prior = NULL,
alpha0.prior = NULL,
beta0.prior = NULL,
default.values = NULL,
...
)
Arguments
data |
Required Argument. |
category.data |
Required Argument. |
encoder.method |
Required Argument.
Notes:
Permitted Values: "CBM_BETA", "CBM_DIRICHLET", "CBM_GAUSSIAN_INVERSE_GAMMA" |
target.columns |
Required Argument.
Types: character OR vector of Strings (character) |
response.column |
Required Argument.
Types: character |
alpha.prior |
Optional Argument. |
beta.prior |
Optional Argument. |
alpha.priors |
Optional Argument.
Types: integer OR vector of integers |
num.distinct.responses |
Required when "encoder.method" is 'CBM_DIRICHLET',
optional otherwise. |
u0.prior |
Optional Argument. |
v0.prior |
Optional Argument. |
alpha0.prior |
Optional Argument. |
beta0.prior |
Optional Argument. |
default.values |
Optional Argument.
Types: integer OR vector of integers |
... |
Specifies the generic keyword arguments SQLE functions accept. Below
are the generic keyword arguments: volatile: Function allows the user to partition, hash, order or local order the input data. These generic arguments are available for each argument that accepts tbl_teradata as input and can be accessed as:
Note: |
Value
Function returns an object of class "td_target_encoding_fit_sqle"
which is a named list containing object of class "tbl_teradata".
Named list member(s) can be referenced directly with the "$" operator
using the name(s):
result
output.data
Examples
# Get the current context/connection.
con <- td_get_context()$connection
# Load the example data.
loadExampleData("tdplyr_example", "titanic")
# Create tbl_teradata object.
data_input <- tbl(con, "titanic")
# Check the list of available analytic functions.
display_analytic_functions()
# Find the distinct values and counts for column 'sex' and 'embarked'.
res <- td_categorical_summary_sqle(data = data_input,
target.columns = c("sex", "embarked"))
# Find the distinct count of 'sex' and 'embarked' in which only
# 2 column should be present name 'ColumnName' and 'CategoryCount'.
category_data <- res$result
group_by(ColumnName)
summarize(CategoryCount = n())
# Example 1 : Generates the required hyperparameters when "encoder.method" is
# 'CBM_BETA'.
TargetEncodingFit_out1 <- td_target_encoding_fit_sqle(
data = data_input,
category.data = category_data,
encoder.method = 'CBM_BETA',
target.columns = c('sex', 'embarked'),
response.column = 'survived',
default.values = c(-1, -2))
# Print the result.
print(TargetEncodingFit_out1$result)
print(TargetEncodingFit_out1$output.data)
# Example 2 : Generates the required hyperparameters when "encoder.method"
# is 'CBM_DIRICHLET'.
TargetEncodingFit_out2 <- td_target_encoding_fit_sqle(
data = data_input,
category.data = category_data,
encoder.method = 'CBM_DIRICHLET',
target.columns = c('sex', 'embarked'),
response.column = 'pclass',
num.distinct.responses = 3)
# Print the result.
print(TargetEncodingFit_out2$result)
print(TargetEncodingFit_out2$output.data)
# Example 3 : Generates the required hyperparameters when "encoder.method"
# is 'CBM_GAUSSIAN_INVERSE_GAMMA'.
TargetEncodingFit_out3 <- td_target_encoding_fit_sqle(
data = data_input,
category.data = category_data,
encoder.method = 'CBM_GAUSSIAN_INVERSE_GAMMA',
target.columns = c('sex', 'embarked'),
response.column = 'age',
default.values = c(-1, -2))
# Print the result.
print(TargetEncodingFit_out3$result)
print(TargetEncodingFit_out3$output.data)