Model Diagnostics | Logistic Regression | Vantage Analytics Library - Model Diagnostics

Logistic regression has counterparts to many model diagnostics available with linear regression. These diagnostics provide a mathematically sound way to evaluate a model built with logistic regression.

Standard Errors and Statistics

For each b-coefficient, the logistic function computes the standard error, T-statistic (or Wald statistic), and t-distribution probability value.

The T-statistic is the ratio of a b-coefficient value to its standard error. You can use the T-statistic and t-distribution probability value to assess the statistical significance of this b-coefficient in the model.

To compute the standard errors of the b-coefficients, the function uses an information matrix (or Hessian matrix), a matrix of second-order partial derivatives of the log likelihood function with respect to all possible pairs of the coefficient values.

This is the formula for information matrix element Aj, k:

Odds Ratios and Confidence Intervals

The logistic function computes confidence intervals using odds ratios.

In a linear regression model, each b-coefficient represents the change in the dependent y variable value when the corresponding independent x value changes by 1. In a logistic regression model, increasing an x variable value by 1 implies a change in the odds that the outcome y variable value is 1 rather than 0.

Here is the formula for the logit response function again:

The response function is the log of the odds that the response is 1, where π(x) is the probability that the response is 1 and 1 – π(x) is the probability that the response is 0. If xj varies by 1, the response function varies by bj. That is:

g(x0 … xj) - g(x0 … xj … xn) = bj

Equivalently:

Alternate logit response function formula

Therefore, this is the formula for the odds ratio of the coefficient bj:

By taking the exponent of a b-coefficient, you get the odds ratio that is the factor by which the odds change due to a unit increase in xj.

Confidence intervals calculated on odds ratios for each b-coefficient are more meaningful than those calculated on the b-coefficients themselves. The confidence interval is computed based on a 95% confidence level and a two-tailed normal distribution.

Logistic Regression Goodness of Fit

In linear regression, a key measure associated with goodness of fit is the residual sums of squares (RSS). The analogous measure for logistic regression is the deviance. The deviance (D) is the ratio of the likelihood of a given model to the likelihood of a perfectly fitted or saturated model:

D = -2ln(ModelLH / SatModelLH)

Equivalently, in terms of the model log likelihood and the saturated model log likelihood:

D = -2LM + 2LS

Looking at the data as a set of n independent Bernoulli observations, LS=0, so D = -2LM.

You can compare two models by taking the difference between their deviance values:

G = D1 - D2 = -2(L1 - L2)

To evaluate the independent model terms as a whole, calculate the difference in deviance for the model with a constant term only and the model with all variables fitted:

G = -2(L0 - LM)

Calculate LM with the log likelihood formula.

Calculate L0 with this formula, where n is the number of observations:

G has a chi-squared distribution with v-1 degrees of freedom, where v is the number of variables. Therefore, G is the probability that zero is the correct value for every x-term coefficient.

Several pseudo R-squared values are suggested. They are not true goodness of fit measures, but can be useful in evaluating the model. [Agresti]

The logistic function provides one such measure, suggested by McFadden:

(L0 - LM) / L0

Model Diagnostics | Logistic Regression | Vantage Analytics Library - Model Diagnostics - Vantage Analytics Library

Vantage Analytics Library User Guide

Standard Errors and Statistics

Odds Ratios and Confidence Intervals

Logistic Regression Goodness of Fit