| |
- TargetEncodingFit(data=None, category_data=None, encoder_method=None, target_columns=None, response_column=None, alpha_prior=None, beta_prior=None, alpha_priors=None, num_distinct_responses=None, u0_prior=None, v0_prior=None, alpha0_prior=None, beta0_prior=None, default_values=None, **generic_arguments)
- DESCRIPTION:
The TargetEncodingFit() function generally uses the likelihood or expected
value of the target variable for each category and encodes that category with
that value. This technique works for both binary classification and regression
and for multiclass classification a similar technique is applied, which encodes
the categorical variable with k new variables, where k is the number of classes.
The TargetEncodingFit() function takes the input data and a categorical data as
input and generates the required hyperparameters, which will be used by the
TargetEncodingTransform() function for encoding the categorical values.
Notes:
* This function requires the UTF8 client character set.
* This function does not support Pass-Through Characters (PTCs).
* This function does not support KanjiSJIS or Graphic data types.
* The maximum number of unique categories in the particular
column is 4000.
* The maximum category length is 128 characters.
* Columns with a large number of distinct categories can have an
impact on query execution time.
Usage considerations for TargetEncodingFit() function are:
* The input data in the TargetEncodingFit() function can have no
partition at all or have data_partition_column="ANY" .
* The TargetEncodingFit() function requires a category data to be
passed as a dimension. The category data should be generated by the
CategoricalSummary() function.
* Null categories will not be encoded.
* The "default_values" argument should be provided to TargetEncodingFit()
if user want to assign any target value for missing categories in the
TargetEncodingTransform() function.
PARAMETERS:
data:
Required Argument.
Specifies the input data containing the categorical
target columns.
Types: teradataml DataFrame
category_data:
Required Argument.
Specifies the data containing the unique categories and their counts
for each target columns.
Types: teradataml DataFrame
encoder_method:
Required Argument.
Specifies the encoder method:
* If the response variable is following a binary classification,
for example, values are either 0 or 1, use "encoder_method" as 'CBM_BETA'.
* If the response variable is following a multi-class classification,
for example, values are (1,...,k, where k is the number of classes),
use "encoder_method" as 'CBM_DIRICHLET'.
* If the response variable is following a regression, for example,
values are contiguous numeric values, use "encoder_method" as
'CBM_GAUSSIAN_INVERSE_GAMMA'.
Notes:
* The maximum length supported is 128.
* "encoder_method" are not case sensitive.
Permitted Values: CBM_BETA, CBM_DIRICHLET, CBM_GAUSSIAN_INVERSE_GAMMA
Types: str
target_columns:
Required Argument.
Specifies the column from the "data" that contains the categorical values
to be encoded.
Notes:
* The maximum length supported is 128.
* The maximum list length is 2018.
* "target_columns" are not case sensitive.
Types: str OR list of Strings (str)
response_column:
Required Argument.
Specifies column from the "data" that contains the response values.
Notes:
* The maximum length supported is 128.
* "response_column" are not case sensitive.
Types: str
alpha_prior:
Optional Argument.
Specifies the prior parameter of the 'CBM_BETA' encoder method.
Types: int
beta_prior:
Optional Argument.
Specifies the prior parameter of the 'CBM_BETA' encoder method.
Types: int
alpha_priors:
Optional Argument.
Specifies the prior parameter of the 'CBM_DIRICHLET' encoder method.
Notes:
* The number of values specified in this argument must be equal to
"num_distinct_responses" value.
* The maximum list length is 2018.
Types: int OR list of ints
num_distinct_responses:
Required when "encoder_method" is 'CBM_DIRICHLET',
optional otherwise.
Specifies the number of distinct values present in the
"response_column".
Types: int
u0_prior:
Optional Argument.
Specifies the prior parameter of the 'CBM_GAUSSIAN_INVERSE_GAMMA'
encoder method.
Types: int
v0_prior:
Optional Argument.
Specifies the prior parameter of the 'CBM_GAUSSIAN_INVERSE_GAMMA'
encoder method.
Types: int
alpha0_prior:
Optional Argument.
Specifies the prior parameter of the 'CBM_GAUSSIAN_INVERSE_GAMMA'
encoder method.
Types: int
beta0_prior:
Optional Argument.
Specifies the prior parameter of the 'CBM_GAUSSIAN_INVERSE_GAMMA'
encoder method.
Types: int
default_values:
Optional Argument.
Specifies the values to use when the category is not found during transform.
When only one value is specified, it will be applied to all the target columns,
otherwise the number of default values must be equal to the number of target
columns.
Note:
* The maximum list length is 2018.
Types: int OR list of ints
**generic_arguments:
Specifies the generic keyword arguments SQLE functions accept. Below
are the generic keyword arguments:
persist:
Optional Argument.
Specifies whether to persist the results of the
function in a table or not. When set to True,
results are persisted in a table; otherwise,
results are garbage collected at the end of the
session.
Default Value: False
Types: bool
volatile:
Optional Argument.
Specifies whether to put the results of the
function in a volatile table or not. When set to
True, results are stored in a volatile table,
otherwise not.
Default Value: False
Types: bool
Function allows the user to partition, hash, order or local
order the input data. These generic arguments are available
for each argument that accepts teradataml DataFrame as
input and can be accessed as:
* "<input_data_arg_name>_partition_column" accepts str or
list of str (Strings)
* "<input_data_arg_name>_hash_column" accepts str or list
of str (Strings)
* "<input_data_arg_name>_order_column" accepts str or list
of str (Strings)
* "local_order_<input_data_arg_name>" accepts boolean
Note:
These generic arguments are supported by teradataml if
the underlying SQL Engine function supports, else an
exception is raised.
RETURNS:
Instance of TargetEncodingFit.
Output teradataml DataFrames can be accessed using attribute
references, such as TargetEncodingFitObj.<attribute_name>.
Output teradataml DataFrame attribute name is:
1. result
2. output_data
RAISES:
TeradataMlException, TypeError, ValueError
EXAMPLES:
# Notes:
# 1. Get the connection to Vantage to execute the function.
# 2. One must import the required functions mentioned in
# the example from teradataml.
# 3. Function will raise error if not supported on the Vantage
# user is connected to.
# Load the example data.
load_example_data("teradataml", ["titanic"])
# Create teradataml DataFrame objects.
data_input = DataFrame.from_table("titanic")
# Check the list of available analytic functions.
display_analytic_functions()
# Find the distinct values and counts for column 'sex' and 'embarked'.
categorical_summ = CategoricalSummary(data = data_input,
target_columns = ["sex", "embarked"]
)
# Find the distinct count of 'sex' and 'embarked' in which only 2 column should be present
# name 'ColumnName' and 'CategoryCount'.
category_data=categorical_summ.result.groupby('ColumnName').count()
category_data = category_data.assign(drop_columns = True,
ColumnName = category_data.ColumnName,
CategoryCount = category_data.count_DistinctValue)
# Example 1 : Generates the required hyperparameters when "encoder_method" is
# 'CBM_BETA'.
TargetEncodingFit_out1 = TargetEncodingFit(data = data_input,
category_data = category_data,
encoder_method = 'CBM_BETA',
target_columns = ['sex', 'embarked'],
response_column = 'survived',
default_values = [-1, -2]
)
# Print the result DataFrame.
print(TargetEncodingFit_out1.result)
print(TargetEncodingFit_out1.output_data)
# Example 2 : Generates the required hyperparameters when "encoder_method" is
# 'CBM_DIRICHLET'.
TargetEncodingFit_out2 = TargetEncodingFit(data = data_input,
category_data = category_data,
encoder_method = 'CBM_DIRICHLET',
target_columns = ['sex', 'embarked'],
response_column = 'pclass',
num_distinct_responses = 3
)
# Print the result DataFrame.
print(TargetEncodingFit_out2.result)
print(TargetEncodingFit_out2.output_data)
# Example 3 : Generates the required hyperparameters when "encoder_method" is
# 'CBM_GAUSSIAN_INVERSE_GAMMA'.
TargetEncodingFit_out3 = TargetEncodingFit(data = data_input,
category_data = category_data,
encoder_method = 'CBM_GAUSSIAN_INVERSE_GAMMA',
target_columns = ['sex', 'embarked'],
response_column = 'age',
default_values = [-1, -2]
)
# Print the result DataFrame.
print(TargetEncodingFit_out3.result)
print(TargetEncodingFit_out3.output_data)
|