Logistic Regression | Vantage Analytics Library - Logistic Regression - Vantage Analytics Library

Vantage Analytics Library User Guide

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
Lake
VMware
Product
Vantage Analytics Library
Release Number
2.2.0
Published
March 2023
Language
English (United States)
Last Update
2024-01-02
dita:mapPath
ibw1595473364329.ditamap
dita:ditavalPath
iup1603985291876.ditaval
dita:id
zyl1473786378775
Product Category
Teradata Vantage
Logistic regression is better than Linear Regression for a dependent variable that has only two possible outcomes. For example:
  • Did the customer buy the product in response to the promotion?
  • Did the customer close their account?

The possible values can be coded as 0 and 1. The expected value of the dependent variable is the probability that it is 1.

With only two possible values, the error term for a linear regression model does not have a normal distribution or constant variance over the values of the independent variables. The linear regression model can produce a value that does not fall within the necessary constraint of 0 to 1.

A logistic regression model computes a continuous probability function between 0 and 1 by applying a logit transformation function to the linear regression expression b0 + b1x1 + ... + bnxn.

The Analytics Library function logistic builds a model with a two-valued dependent variable (that is, a binary logit model). However, you need not code your dependent variable as two distinct values. You specify the dependent variable (response variable) and the function treats the other variables as nonresponse variables.

The response variable can have values other than 1 and 0, but for ease of reading, this document represents the response variable value as 1 and each nonresponse variable value as 0.

Even though values other than 1 and 0 are supported in the dependent variable, throughout this section, the dependent variable response value is represented as 1 and the non-response value as 0 for ease of reading.

The primary sources of information and formulas in this section are [Hosmer] and [Neter].

Logit Model

The logit transformation function is mathematically powerful and simple and lends an intuitive understanding to the coefficients in the model.

The following formulas describe the logistic regression model, where π(x) is the probability that the dependent variable is 1 and g(x) is the logit transformation:


Logit model (probability equation)

Logit transformation equation

The logit transformation g(x) has linear parameters (b-coefficients) and may be continuous with unrestricted range. These formulas find a binomial error distribution with y = π(x) + Ɛ. The solution to a logistic regression model is to find the b-coefficients that best predict the dichotomous y variable based on the values of the numeric x variables.

Maximum Likelihood

To find the best b-coefficients for the logical regression model, use maximum likelihood. This approach selects b-coefficient values and calculates the likelihood that they match the defined logistic distribution, assuming errors have a normal probability distribution.

For the linear regression, the maximum likelihood and least-squares approaches produce mathematically equivalent results. This is not true for logistic regression. You must use maximum likelihood directly.

For convenience, compute the natural logarithm of the likelihood function so you can convert the product of likelihoods to a sum, which is easier to use.

This is the formula for a vector B of b-coefficients with v x variables, where B'X = b0 + b1x1 + … + bvxv:


Log likelihood equation

Derive the likelihood formulas by differentiating the preceding formula with respect to the constant term b0 and the variables bi:


""

Derived likelihood equation

Derived likelihood equation

Computational Technique

The log likelihood formula is not linear in the unknown b-coefficient parameter values, so solving it requires nonlinear optimization techniques. Calculations cannot be based on an SSCP matrix. The logistic function uses the iteratively reweighted least squares (RLS) technique, which is equivalent to the Gauss-Newton technique. RLS grows in complexity, approximately as the square of the number of columns.

The logistic function dynamically generates SQL to perform the calculations to solve the model, produce model diagnostics and success tables, and score new data with the model that it builds.

To improve performance with small data sets, Analytics Library has an optional in-memory calculation feature (which is also helpful in Stepwise Logistic Regression and Logistic Regression Step N (Stepwise-only)). This feature selects the data into server system memory if it fits into the specified maximum memory amount (see memorysize in Syntax).

Logistic Regression and Missing Data

Null values for columns in a logistic regression analysis can adversely affect results, so the logistic function ignores rows that have null values for independent or dependent variables. To replace null values in a table before inputting it to the logistic function, use the function Null Replacement or Recode.