Stepwise Logistic Regression - Teradata Warehouse Miner

Teradata Warehouse Miner User Guide - Volume 3Analytic Functions

Product
Teradata Warehouse Miner
Release Number
5.4.5
Published
February 2018
Language
English (United States)
Last Update
2018-05-04
dita:mapPath
yuy1504291362546.ditamap
dita:ditavalPath
ft:empty
dita:id
B035-2302
Product Category
Software

Automated stepwise regression procedures are available for logistic regression to aid in model selection just as they are for linear regression. The procedures are in fact very similar to those described for linear regression. As such, an attempt will be made to highlight the similarities and differences in the descriptions below.

As is the case with stepwise linear regression, the automated stepwise procedures described below can provide insight into the variables that should be included in a logistic regression model. An element of human decision-making however is recommended in order to produce a model with useful business application.

Forward-Only Stepwise Logistic Regression

The forward only procedure consists solely of forward steps as described below, starting without any independent x variables in the model. Forward steps are continued until no variables can be added to the model.

Forward Stepwise Logistic Regression

The forward stepwise procedure is a combination of the forward and backward steps always done in pairs, as described below, starting without any independent x variables in the model. One forward step is always followed by one backward step, and these single forward and backward steps are alternated until no variables can be added or removed. Additional checks are made after each step to see if the same variables exist in the model as existed after a previous step in the same direction. When this condition is detected in both the forward and backward directions the algorithm will also terminate.

Backward-Only Stepwise Logistic Regression

The backward only procedure consists solely of backward steps as described below, starting with all of the independent x variables in the model. Backward steps are continued until no variables can be removed from the model.

Backward Stepwise Logistic Regression

The backward stepwise procedure is a combination of the backward and forward steps always done in pairs, as described below, starting with all of the independent x variables in the model. One backward step is followed by one forward step, and these single backward and forward steps are alternated until no variables can be added or removed. Additional checks are made after each step to see if the same variables exist in the model as existed after a previous step in the same direction. When this condition is detected in both the backward and forward directions the algorithm will also terminate.

Stepwise Logistic Regression - Forward step

In stepwise linear regression the partial F statistic, or the analogous T-statistic probability value, is computed separately for each variable outside the model, adding each of them into the model one at a time. The analogous procedure for logistic regression would consist of computing the likelihood ratio statistic G, defined in Logistic Regression Model, for each variable outside the model, selecting the variable that results in the largest G value when added to the model. In the case of logistic regression however this becomes an expensive proposition because the solution of the model for each variable requires another iterative maximum likelihood solution, contrasted to the more rapidly achieved closed form solution available in linear regression.

What is needed is a statistic that can be calculated without requiring an additional maximum likelihood solution. Teradata Warehouse Miner uses such a statistic proposed by Peduzzi, Hardy and Holford that they call a W statistic. This statistic is comparatively inexpensive to compute for each variable outside the model and is therefore expedient to use as a criterion for selecting a variable to add to the model. The W statistic is assumed to follow a chi square distribution with one degree of freedom due to its similarity to other statistics, and it gives evidence of behaving similarly to the likelihood ratio statistic. Therefore, the variable with the smallest chi square probability or P-value associated with its W statistic is added to the model in a forward step if the P-value is less than the criterion to enter. If more than one variable has a P-value of 0, then the variable with the largest W statistic is entered. For more information, refer to [Peduzzi, Hardy and Holford].

Stepwise Logistic Regression - Backward step

Each backward step seeks to remove those variables that have statistical significance below a certain level. This is done by first fitting the model with the currently selected variables, including the calculation of the probability or P-value associated with the T-statistic for each variable, which is the ratio of the b-coefficient to its standard error. The variable with the largest P-value is removed if it is greater than the criterion to remove.