Automated stepwise regression analysis is a technique to aid in regression model selection. That is, it helps in deciding which independent variables to include in a regression model. If there are only two or three independent variables under consideration, one could try all possible models. But since there are 2 k - 1 models that can be built from k variables, this quickly becomes impractical as the number of variables increases (32 variables yield more than 4 billion models!).
The automated stepwise procedures described below provide insight into the variables that should be included in a regression model. It is not recommended that stepwise procedures be the sole deciding factor in the makeup of a model. For one thing, these techniques are not guaranteed to produce the best results. And sometimes, variables should be included because of certain descriptive or intuitive qualities, or excluded for subjective reasons. Therefore an element of human decision-making is recommended to produce a model with useful business application.
Forward-Only Stepwise Linear Regression
The forward only procedure consists solely of forward steps as described below, starting without any independent x variables in the model. Forward steps are continued until no variables can be added to the model.
Forward Stepwise Linear Regression
The forward stepwise procedure is a combination of the forward and backward steps described below, starting without any independent x variables in the model. One forward step is followed by one backward step, and these single forward and backward steps are alternated until no variables can be added or removed.
Backward-Only Stepwise Linear Regression
The backward only procedure consists solely of backward steps as described below, starting with all of the independent x variables in the model. Backward steps are continued until no variables can be removed from the model.
Backward Stepwise Linear Regression
The backward stepwise procedure is a combination of the backward and forward steps as described below, starting with all of the independent x variables in the model. One backward step is followed by one forward step, and these single backward and forward steps are alternated until no variables can be added or removed.
Stepwise Linear Regression - Forward Step
Each forward step seeks to add the independent variable x that best contribute to explaining the variance in the dependent variable y. In order to do this a quantity called the partial F statistic must be computed for each x i variable that can be added to the model. A quantity called the extra sums of squares is first calculated as follows:
ESS = “DRS with xi” - “DRS w/o xi”
where DRS is the Regression Sums of squares or “due-to-regression sums of squares”.
Then, the partial F statistic is given by:
f(xi) = ESS(xi) / meanRSS(xi)
where meanRSS is the Residual Mean Square.
Each forward step then consists of adding the variable with the largest partial F statistic providing it is greater than the criterion to enter value.
An equivalent alternative to using the partial F statistic is to use the probability or P-value associated with the T-statistic mentioned earlier under model diagnostics. The t statistic is the ratio of the b-coefficient to its standard error. Teradata Warehouse Miner offers both alternatives as an option. When the P-value is used, a forward step consists of adding the variable with the smallest P-value providing it is less than the criterion to enter. In this case, if more than one variable has a P-value of 0, the variable with the largest F statistic is entered.
Stepwise Linear Regression - Backward Step
Each backward step seeks to remove the independent variable x i that least contributes to explaining the variance in the dependent variable y. The partial F statistic is calculated for each independent x variable in the model. If the smallest value is less than the criterion to remove, it is removed.
As with forward steps, an option is provided to use the probability or P-value associated with the T-statistic, that is, the ratio of the b-coefficient to its standard error. In this case all the probabilities or P-values are calculated for the variables currently in the model at one time, and the one with the largest P-value is removed if it is greater than the criterion to remove.
Linear Regression and Missing Data
Null values for columns in a linear regression analysis can adversely affect results. It is recommended that the listwise deletion option be used when building the input matrix with the Build Matrix function. This ensures that any row for which one of the columns is null will be left out of the matrix computations completely. Another strategy is to use the Recoding transformation function to build a new column, substituting a fixed known value for null values. Yet another option is to use one of the analytic algorithms in Teradata Warehouse Miner to estimate replacement values for null values. This technique is often called missing value imputation.