GLM
Description
The generalized linear model td_glm_sqle
function performs regression and classification
analysis on data sets, where the response follows an exponential family distribution
and supports the following models:
Regression (GAUSSIAN family): The loss function is squared error.
Binary Classification (BINOMIAL family): The loss function is logistic and implements logistic regression.
The only response values are 0 or 1.
The function uses the Minibatch Stochastic Gradient Descent (SGD) algorithm that is highly
scalable for large datasets. The algorithm estimates the gradient of loss in minibatches,
which is defined by the "batch.size" argument and updates the model with a learning rate using
the "learning.rate" argument.
The function also supports the following approaches:
L1, L2, and Elastic Net Regularization for shrinking model parameters.
Accelerated learning using Momentum and Nesterov approaches.
The function uses a combination of "iter.num.no.change" and "tolerance" arguments
to define the convergence criterion and runs multiple iterations (up to the specified
value in the "iter.max" argument) until the algorithm meets the criterion.
The function also supports LocalSGD, a variant of SGD, that uses "local.sgd.iterations"
on each AMP to run multiple batch iterations locally followed by a global iteration.
The weights from all mappers are aggregated in a reduce phase and are used to compute
the gradient and loss in the next iteration. LocalSGD lowers communication costs and
can result in faster learning and convergence in fewer iterations, especially when there
is a large cluster size and many features.
Due to gradient-based learning, the function is highly-sensitive to feature scaling.
Before using the features in the function, you must standardize the Input features
using td_scale_fit_sqle()
and td_scale_transform_sqle()
functions.
The function only accepts numeric features. Therefore, before training, you must convert
the categorical features to numeric values.
The function skips the rows with missing (null) values during training.
The function output is a trained td_glm_sqle model that is used as an input to the td_tdglm_predict_sqle()
function. The model also contains model statistics of MSE, Loglikelihood, AIC, and BIC.
You can use td_regression_evaluator_sqle()
, td_classification_evaluator_sqle()
, and td_roc_sqle()
functions to perform
model evaluation as a post-processing step.
Usage
td_glm_sqle (
formula = NULL,
data = NULL,
input.columns = NULL,
response.column = NULL,
family = "GAUSSIAN",
iter.max = 300,
batch.size = 10,
lambda1 = 0.02,
alpha = 0.15,
iter.num.no.change = 50,
tolerance = 0.001,
intercept = TRUE,
class.weights = "0:1.0, 1:1.0",
learning.rate = NULL,
initial.eta = 0.05,
decay.rate = 0.25,
decay.steps = 5,
momentum = 0.0,
nesterov = TRUE,
local.sgd.iterations = 0,
...
)
Arguments
formula |
Required Argument when "input.columns" and "response.column" are not provided,
Types: character |
data |
Required Argument. |
input.columns |
Required Argument when "formula" is not provided, optional otherwise.
Types: character OR vector of Strings (character) |
response.column |
Required Argument when "formula" is not provided, optional otherwise.
Types: character |
family |
Optional Argument. |
iter.max |
Optional Argument.
Default Value: 300 |
batch.size |
Optional Argument.
Default Value: 10 |
lambda1 |
Optional Argument.
Default Value: 0.02 |
alpha |
Optional Argument.
Default Value: 0.15(15 Types: float OR integer |
iter.num.no.change |
Optional Argument.
Default Value: 50 |
tolerance |
Optional Argument.
Default Value: 0.001 |
intercept |
Optional Argument. |
class.weights |
Optional Argument.
Default Value: "0:1.0, 1:1.0" |
learning.rate |
Optional Argument.
Types: character |
initial.eta |
Optional Argument. |
decay.rate |
Optional Argument.
Default Value: 0.25 |
decay.steps |
Optional Argument. |
momentum |
Optional Argument.
Default Value: 0.0 |
nesterov |
Optional Argument.
Default Value: TRUE |
local.sgd.iterations |
Optional Argument.
Note:
Default Value: 0 |
... |
Specifies the generic keyword arguments SQLE functions accept. Below volatile: Function allows the user to partition, hash, order or local order the input data. These generic arguments are available for each argument that accepts tbl_teradata as input and can be accessed as:
Note: |
Value
Function returns an object of class "td_glm_sqle"
which is a named list containing object of class "tbl_teradata".
Named list member(s) can be referenced directly with the "$" operator
using the name(s):
result
output.data
Examples
# Get the current context/connection.
con <- td_get_context()$connection
# Load the example data.
loadExampleData("glm_example", "admissions_train")
# Create tbl_teradata object.
admissions_train <- tbl(con, "admissions_train")
# Check the list of available analytic functions.
display_analytic_functions()
# td_glm_sqle() function requires features in numeric format for processing,
# so first let's transform categorical columns to numerical columns
# using VAL td_transform_valib() function.
# Set VAL install location.
options(val.install.location = "VAL")
# Define encoders for categorical columns.
masters_code <- tdOneHotEncoder(values = c("yes", "no"),
column = "masters",
out.column = "masters")
stats_code <- tdOneHotEncoder(values=c("Advanced", "Novice"),
column="stats",
out.column="stats")
programming_code <- tdOneHotEncoder(values=c("Advanced",
"Novice",
"Beginner"),
column="programming",
out.column="programming")
# Retain numerical columns.
retain <- tdRetain(columns=c("admitted", "gpa"))
# Transform categorical columns to numeric columns.
glm_numeric_input <- td_transform_valib(data=admissions_train,
one.hot.encode=c(masters_code,
stats_code,
programming_code),
retain=retain)
# Example 1 : Generate generalized linear model(GLM) using
# input tbl_teradata and provided formula.
GLM_out_1 <- td_glm_sqle(
formula = admitted ~ gpa + yes_masters +
no_masters + Advanced_stats + Novice_stats +
Advanced_programming + Novice_programming +
Beginner_programming,
data = glm_numeric_input$result,
learning.rate = 'INVTIME',
momentum = 0.0
)
# Print the result.
print(GLM_out_1$result)
print(GLM_out_1$output.data)
# Example 2 : Generate generalized linear model(GLM) using
# input tbl_teradata and input.columns and response.column
# instead of formula.
GLM_out_2 <- td_glm_sqle(input.columns=
c("gpa", "yes_masters", "no_masters",
"Advanced_stats", "Novice_stats",
"Advanced_programming",
"Novice_programming",
"Beginner_programming"),
response.column = "admitted",
data = glm_numeric_input$result,
learning.rate = 'INVTIME',
momentum = 0.0)
# Print the result.
print(GLM_out_2$result)
print(GLM_out_2$output.data)