- Did the customer buy the product in response to the promotion?
- Did the customer close their account?
The possible values can be coded as 0 and 1. The expected value of the dependent variable is the probability that it is 1.
With only two possible values, the error term for a linear regression model does not have a normal distribution or constant variance over the values of the independent variables. The linear regression model can produce a value that does not fall within the necessary constraint of 0 to 1.
A logistic regression model computes a continuous probability function between 0 and 1 by applying a logit transformation function to the linear regression expression b0 + b1x1 + ... + bnxn.
The Analytics Library function logistic builds a model with a two-valued dependent variable (that is, a binary logit model). However, you need not code your dependent variable as two distinct values. You specify the dependent variable (response variable) and the function treats the other variables as nonresponse variables.
The response variable can have values other than 1 and 0, but for ease of reading, this document represents the response variable value as 1 and each nonresponse variable value as 0.
Even though values other than 1 and 0 are supported in the dependent variable, throughout this section, the dependent variable response value is represented as 1 and the non-response value as 0 for ease of reading.
The primary sources of information and formulas in this section are [Hosmer] and [Neter].
Logit Model
The logit transformation function is mathematically powerful and simple and lends an intuitive understanding to the coefficients in the model.
The following formulas describe the logistic regression model, where π(x) is the probability that the dependent variable is 1 and g(x) is the logit transformation:
The logit transformation g(x) has linear parameters (b-coefficients) and may be continuous with unrestricted range. These formulas find a binomial error distribution with y = π(x) + Ɛ. The solution to a logistic regression model is to find the b-coefficients that best predict the dichotomous y variable based on the values of the numeric x variables.
Maximum Likelihood
To find the best b-coefficients for the logical regression model, use maximum likelihood. This approach selects b-coefficient values and calculates the likelihood that they match the defined logistic distribution, assuming errors have a normal probability distribution.
For the linear regression, the maximum likelihood and least-squares approaches produce mathematically equivalent results. This is not true for logistic regression. You must use maximum likelihood directly.
For convenience, compute the natural logarithm of the likelihood function so you can convert the product of likelihoods to a sum, which is easier to use.
This is the formula for a vector B of b-coefficients with v x variables, where B'X = b0 + b1x1 + … + bvxv:
Derive the likelihood formulas by differentiating the preceding formula with respect to the constant term b0 and the variables bi:
Computational Technique
The log likelihood formula is not linear in the unknown b-coefficient parameter values, so solving it requires nonlinear optimization techniques. Calculations cannot be based on an SSCP matrix. The logistic function uses the iteratively reweighted least squares (RLS) technique, which is equivalent to the Gauss-Newton technique. RLS grows in complexity, approximately as the square of the number of columns.
The logistic function dynamically generates SQL to perform the calculations to solve the model, produce model diagnostics and success tables, and score new data with the model that it builds.
To improve performance with small data sets, Analytics Library has an optional in-memory calculation feature (which is also helpful in Stepwise Logistic Regression and Logistic Regression Step N (Stepwise-only)). This feature selects the data into server system memory if it fits into the specified maximum memory amount (see memorysize in Syntax).
Logistic Regression and Missing Data
Null values for columns in a logistic regression analysis can adversely affect results, so the logistic function ignores rows that have null values for independent or dependent variables. To replace null values in a table before inputting it to the logistic function, use the function Null Replacement or Recode.