GLMPerSegment
Description
The td_glm_sqle()
function is used to train the whole data set as one
model. The td_glm_per_segment_sqle()
function is a partition-by-key
function to create a single model for each partition.
The following operations are supported for td_glm_per_segment_sqle()
:
Gaussian linear regression.
Binomial logistic regression for binary classification.
Iteration modes batch and epoch.
Regularization using L1, L2 and Elasticnet.
Mini-batch gradient descent for numeric optimization algorithm.
Training support with and without intercept.
Class-weighted modeling.
Learning rate support for mini-batch gradient descent using constant, dynamic and hybrid gradients.
Learning rate optimization algorithms for mini-batch gradient descent using plain, momentum, and nesterov gradients.
Notes:
The order column can be optionally applied to guarantee the result in each run is deterministic. The situation of indeterministic result in a partition can occur if the "batch.size" argument is less than the number of rows in the partition. However, adding order columns can influence the performance.
A model is generated from the
td_glm_per_segment_sqle()
, and a model fromtd_glm_sqle()
should be the same when the "batch.size" argument is not less than the number of rows in the corresponding partition of the model.-
td_glm_per_segment_sqle()
takes all features as numeric input. Categorical columns must be converted to numeric columns as preprocessing step, such as usingtd_one_hot_encoding_fit_sqle()
/td_one_hot_encoding_transform_sqle()
,td_ordinal_encoding_fit_sqle()
/td_ordinal_encoding_transform_sqle()
, andtd_target_encoding_fit_sqle()
/td_target_encoding_transform_sqle()
. Any observation with a missing value in an input column is ignored and not used fortraining. You can use some imputation function, such as
td_simple_impute_fit_sqle()
/td_simple_impute_transform_sqle()
to do imputation of missing values.Best practice is to standardize the dataset before using
td_glm_per_segment_sqle()
. Standardization, also known as feature scaling, makes a better model and converges quicker.Model evaluation metrics of MSE, Loglikelihood, AIC, and BIC are generated by
td_glm_per_segment_sqle()
. For additional regression and classification metrics, you should usetd_regression_evaluator_sqle()
,td_classification_evaluator_sqle()
andtd_roc_sqle()
functions as post-processing step.-
td_glm_per_segment_sqle()
supports binary classification only. "response.column" for classification accepts values of 0 and 1 for two classes in the response column.
A maximum of 2046 features are supported due to the limitation imposed by the maximum number of columns (2048) in a input data.
"batch.size" and "learning.rate" are directly related. With a larger "batch.size", "learning.rate" can be increased for faster training with fewer iterations.
"iter_num" and "iter.num.no.change" are criteria used to stop learning. To force the function to run through all iterations, disable the "iter.num.no.change" by specifying "iter.num.no.change" to 0.
User need to try different combinations to find the best values for a particular use case.
When an unsupported data type is passed in "input.columns" or "response.column", the error message is of the following format: Unsupported data type for column index n in argument InputColumns. In the message, n refers to the index of the column based on an input to the function comprising "input.columns" and "response.column only. This is due to the rest of the columns in input are not needed by the function and internal optimizer does not expose them to the function. Due to this, n might be different from the actual index in the input data.
Usage
td_glm_per_segment_sqle (
formula = NULL,
data = NULL,
input.columns = NULL,
response.column = NULL,
attribute.data = NULL,
parameter.data = NULL,
family = "GAUSSIAN",
iter.max = 300,
batch.size = 10,
lambda1 = 0.02,
alpha = 0.15,
iter.num.no.change = 50,
tolerance = 0.001,
intercept = TRUE,
class.weights = "0:1.0, 1:1.0",
learning.rate = NULL,
initial.eta = 0.05,
decay.rate = 0.25,
decay.steps = 5,
momentum = 0.0,
nesterov = TRUE,
iteration.mode = "BATCH",
partition.column = NULL,
...
)
Arguments
formula |
Required Argument when "input.columns" and "response.column" are not
provided, optional otherwise.
Types: character |
data |
Required Argument. |
input.columns |
Required argument when "response.column" is provided and "formula" is not
provided, optional otherwise. |
response.column |
Required argument when "response.column" is provided and "formula" is not
provided, optional otherwise. |
attribute.data |
Optional Argument. |
parameter.data |
Optional Argument. |
family |
Optional Argument.
Default Value: GAUSSIAN |
iter.max |
Optional Argument.
Default Value: 300 |
batch.size |
Optional Argument.
Default Value: 10 |
lambda1 |
Optional Argument.
Default Value: 0.02 |
alpha |
Optional Argument.
Default Value: 0.15 |
iter.num.no.change |
Optional Argument.
Default Value: 50 |
tolerance |
Optional Argument.
Default Value: 0.001 |
intercept |
Optional Argument. |
class.weights |
Optional Argument.
Default Value: 0:1.0, 1:1.0 |
learning.rate |
Optional Argument.
Default Value:
Types: character |
initial.eta |
Optional Argument.
Default Value: 0.05. |
decay.rate |
Optional Argument.
Default Value: 0.25. |
decay.steps |
Optional Argument.
Default Value: 5 |
momentum |
Optional Argument.
Default Value: 0.0 |
nesterov |
Optional Argument. |
iteration.mode |
Optional Argument.
Default Value: Batch |
partition.column |
Optional Argument. |
... |
Specifies the generic keyword arguments SQLE functions accept. Below
are the generic keyword arguments: volatile: Function allows the user to partition, hash, order or local order the input data. These generic arguments are available for each argument that accepts tbl_teradata as input and can be accessed as:
Note: |
Value
Function returns an object of class "td_glm_per_segment_sqle"
which is a named list containing object of class "tbl_teradata".
Named list member(s) can be referenced directly with the "$" operator
using the name(s):result
Examples
# Get the current context/connection.
con <- td_get_context()$connection
# Load the example data.
loadExampleData("decisionforestpredict_example", "housing_train")
loadExampleData("tdplyr_example", "housing_train_attribute",
"housing_train_parameter")
# Create tbl_teradata object.
housing_train <- tbl(con, "housing_train")
housing_train_attribute <- tbl(con, "housing_train_attribute")
housing_train_parameter <- tbl(con, "housing_train_parameter")
# Check the list of available analytic functions.
display_analytic_functions()
# Filter the rows from train dataset with homestyle as Classic and Eclectic.
binomial_housing_train = housing_train
# td_glm_per_segment_sqle() function requires features in numeric
# format for processing, so dropping the non-numeric columns.
drop_cols <- c("driveway", "recroom", "gashw", "airco", "prefarea",
"fullbase")
binomial_housing_train <- binomial_housing_train
gaussian_housing_train <- binomial_housing_train
# Transform the train dataset categorical values to encoded values.
housing_train_ordinal_encodingfit <- td_ordinal_encoding_fit_sqle(
target.column='homestyle',
data=binomial_housing_train)
res <- td_ordinal_encoding_transform_sqle(data=binomial_housing_train,
object=housing_train_ordinal_encodingfit$result,
accumulate=c("sn", "price",
"lotsize", "bedrooms",
"bathrms", "stories"))
# Example 1: Train the model using the 'Gaussian' family.
GLMPerSegment_out_1 <- td_glm_per_segment_sqle(data=gaussian_housing_train,
data.partition.column="stories",
input.columns=c('garagepl',
'lotsize',
'bedrooms',
'bathrms'),
response.column="price",
family="Gaussian",
iter.max=1000,
batch.size=9)
# Print the result.
print(GLMPerSegment_out_1$result)
# Example 2: Train the model using the 'Binomial' family, formula argument
# and subset of features and parameters to be used with respect
# to "partition_id".
formula <- homestyle ~ price + lotsize + bedrooms + bathrms
GLMPerSegment_out_2 <- td_glm_per_segment_sqle(
data=res$result,
data.partition.column="stories",
formula = formula,
attribute.data=housing_train_attribute,
attribute.data.partition.column="partition_id",
parameter.data=housing_train_parameter,
parameter.data.partition.column="partition_id",
family="Binomial",
iter.max=100)
# Print the result.
print(GLMPerSegment_out_2$result)