TD_GLM function is a generalized linear model (GLM) that performs regression and classification analysis on data sets, where the response follows an exponential family distribution and supports the following models:
- Regression (Gaussian family): The loss function is squared error.
- Binary Classification (Binomial family): The loss function is logistic and implements logistic regression. The response values are 0 or 1.
GLMs are a flexible class of statistical models that extend the linear regression framework to accommodate a wide range of response variables, including binary, count, and continuous data. GLMs assume the response variable has a probability distribution from an exponential family of distributions, which includes commonly-used distributions such as the normal, binomial, and Poisson distributions.
GLMs consist of the following key components:
- Linear predictor: A predictor variables and their coefficients, similar to linear regression. It uses predictor variables X, and their coefficients β, and η = Xβ.
- Link function: The relationship of the linear predictor to the mean of the response variable, allowing for non-linear relationships between the predictors and response. It uses the link function g for g(μ) = η.
- Probability distribution: The variability of the response variable, and is chosen based on the nature of the data. The variance is calculated as Var(Y) = φV(μ), where φ is a scale parameter, and V(μ) is the variance function.
GLMs are fitted using maximum likelihood estimation, which involves finding the parameter values that maximize the likelihood of observing the data given the model. Model fit can be assessed using various goodness-of-fit measures, such as deviance or Pearson chi-squared statistics.
By specifying the appropriate link and variance functions, GLMs can be used to model a wide range of response variables. For example, the logistic regression model for binary data has the following components:
- Probability distribution: Bernoulli distribution
- Linear predictor: η = Xβ
- Link function: logit (g(μ) = logit(μ) = log(μ/(1-μ)))
Similarly, the Poisson regression model for count data has the following components:
- Probability distribution: Poisson distribution
- Linear predictor: η = Xβ
- Link function: log (g(μ) = log(μ))
- Variance function: Var(Y) = μ
TD_GLM uses the Minibatch Stochastic Gradient Descent (SGD) algorithm. The algorithm estimates the gradient of loss in minibatches, which is defined by the BatchSize argument and updates the model with a learning rate using the LearningRate argument.
The function also supports the following approaches:
- L1, L2, and Elastic Net Regularization for shrinking model parameters
- Accelerated learning using Momentum and Nesterov approaches
TD_GLM uses a combination of IterNumNoChange and Tolerance arguments to define the convergence criterion and runs multiple iterations (up to the specified value in the MaxIterNum argument) until the algorithm meets the criterion. The function also supports LocalSGD, a variant of SGD, that uses LocalSGDIterations on each AMP to run multiple batch iterations locally followed by a global iteration. The weights from all mappers are aggregated in a reduce phase to compute the gradient and loss in the next iteration. LocalSGD lowers communication costs and can result in faster learning and convergence in fewer iterations, especially when there is a large cluster size and many features.
Due to gradient-based learning, TD_GLM is highly sensitive to feature scaling. Before using the features in the function, you must standardize the Input features using TD_ScaleFit and TD_ScaleTransform.
TD_GLM only accepts numeric features. Therefore, before training, you must convert the categorical features to numeric values. The function skips the rows with missing (null) values during training.
The function output is a trained GLM model that is used as input to the TD_GLMPredict function. The output contains model statistics of MSE, Loglikelihood, AIC, and BIC. You can use TD_RegressionEvaluator, TD_ClassificationEvaluator, and TD_ROC functions to perform model evaluation as a post-processing step.
The function skips the rows with missing (null) values during training. You can use an imputation functions, such as TD_SimpleImputeFit and TD_SimpleImputeTransform to do imputation of missing values.
The TD_GLM function is used to train the whole data set as one model or each data partition as a single mode. To train the whole data set, specify partition by any function in the ON clause. To train each data partition, specify partition by key function in the ON clause.
A model generated from partition-by-any can be matched with a model from partition-by-key if all input data for partition-by-any is on a single AMP and the BatchSize value is greater than or equal to the number of rows in input.
GLMs are used for regression analysis, classification, and survival analysis. They have applications in fields such as medicine, biology, economics, and engineering.