- Regression (Gaussian family): The loss function is squared error.
- Binary Classification (Binomial family): The loss function is logistic and implements logistic regression. The response values are 0 or 1.
GLMs are a flexible class of statistical models that extend the linear regression framework to accommodate a wide range of response variables, including binary, count, and continuous data. GLMs assume the response variable has a probability distribution from an exponential family of distributions, which includes commonly-used distributions such as the normal, binomial, and Poisson distributions.
- Linear predictor: A predictor variables and their coefficients, similar to linear regression. It uses predictor variables X, and their coefficients β, and η = Xβ.
- Link function: The relationship of the linear predictor to the mean of the response variable, allowing for non-linear relationships between the predictors and response. It uses the link function g for g(μ) = η.
- Probability distribution: The variability of the response variable, and is chosen based on the nature of the data. The variance is calculated as Var(Y) = φV(μ), where φ is a scale parameter, and V(μ) is the variance function.
- Probability distribution: Bernoulli distribution
- Linear predictor: η = Xβ
- Link function: logit (g(μ) = logit(μ) = log(μ/(1-μ)))
- Probability distribution: Poisson distribution
- Linear predictor: η = Xβ
- Link function: log (g(μ) = log(μ))
- Variance function: Var(Y) = μ
GLMs are fitted using maximum likelihood estimation, which involves finding the parameter values that maximize the likelihood of observing the data given the model. Model fit can be assessed using various goodness-of-fit measures, such as deviance or Pearson chi-squared statistics.
TD_GLM uses the Minibatch Stochastic Gradient Descent (SGD) algorithm. The algorithm estimates the gradient of loss in minibatches, which is defined by the BatchSize argument and updates the model with a learning rate using the LearningRate argument.
- L1, L2, and Elastic Net Regularization for shrinking model parameters
- Accelerated learning using Momentum and Nesterov approaches
TD_GLM uses a combination of IterNumNoChange and Tolerance arguments to define the convergence criterion and runs multiple iterations (up to the specified value in the MaxIterNum argument) until the algorithm meets the criterion. The function also supports LocalSGD, a variant of SGD, that uses LocalSGDIterations on each AMP to run multiple batch iterations locally followed by a global iteration. The weights from all mappers are aggregated in a reduce phase to compute the gradient and loss in the next iteration. LocalSGD lowers communication costs and can result in faster learning and convergence in fewer iterations, especially when there is a large cluster size and many features.
Due to gradient-based learning, TD_GLM is highly sensitive to feature scaling. Before using the features in the function, you must standardize the Input features using TD_ScaleFit and TD_ScaleTransform.
TD_GLM only accepts numeric features. Therefore, before training, you must convert the categorical features to numeric values. The function skips the rows with missing (null) values during training.
Unsupported data type for column index n in argument InputColumns.
In the message, n refers to the column index based on an input to the function comprising InputColumns and ResponseColumn only. The function does not need the rest of the columns, and the Teradata Vantage optimizer does not project them to the function. Due to this, n might be different from the actual index in the input table.
The function skips the rows with missing (null) values during training. You can use an imputation functions, such as TD_SimpleImputeFit and TD_SimpleImputeTransform to do imputation of missing values.
The TD_GLM function is used to train the whole data set as one model or each data partition as a single mode. To train the whole data set, specify partition by any function in the ON clause. To train each data partition, specify partition by key function in the ON clause.
A model generated from partition-by-any can be matched with a model from partition-by-key if all input data for partition-by-any is on a single AMP and the BatchSize value is greater than or equal to the number of rows in input.
GLMs are used for regression analysis, classification, and survival analysis. They have applications in fields such as medicine, biology, economics, and engineering.