Stepwise Linear Regression | Vantage Analytics Library - Stepwise Linear Regression - Vantage Analytics Library

Vantage Analytics Library User Guide

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
Lake
VMware
Product
Vantage Analytics Library
Release Number
2.2.0
Published
March 2023
Language
English (United States)
Last Update
2024-01-02
dita:mapPath
ibw1595473364329.ditamap
dita:ditavalPath
iup1603985291876.ditaval
dita:id
zyl1473786378775
Product Category
Teradata Vantage

Automated stepwise linear regression helps you decide which independent variables to include in a regression model.

If you have k independent variables, there are 2k - 1 possible models. If k is 2 or 3, you can try all possible models, but as k increases, this approach becomes increasingly impractical. If k is 32, there are more than 4 billion possible models.

The following automated stepwise procedures provide insight into the variables to include in a regression model. However, Teradata does not recommend using the automated results as the only deciding factor. These procedures do not always produce the best results, and sometimes you want to include variables in the model because of their descriptive or intuitive qualities, or exclude them for subjective reasons. To produce a model with useful business application, Teradata recommends some human decision-making.

Stepwise linear regression uses forward steps, backward steps, or both.

Stepwise Linear Regression - Forward Step

Each forward step seeks to add the independent variable x that best contribute to explaining the variance in the dependent variable y. In order to do this, a quantity called the partial F statistic must be computed for each xi variable that can be added to the model. A quantity called the extra sums of squares is first calculated as follows:

ESS = “DRS with xi” - “DRS w/o xi

where DRS is the Regression Sums of squares or “due-to-regression sums of squares”.

Then, the partial F statistic is given by:

f(xi) = ESS(xi) / meanRSS(xi)

where meanRSS is the Residual Mean Square.

Each forward step consists of adding the variable with the largest partial F statistic, providing the variable is greater than the criterion to enter value.

An equivalent alternative to using the partial F statistic is to use the probability or P-value associated with the T-statistic mentioned earlier under Model Diagnostics. The T-statistic is the ratio of the b- coefficient to its standard error. The Analytics Library offers both alternatives as an option. When the P-value is used, the forward step consists of adding the variable with the smallest P-value, providing it is less than the criterion to enter. In this case, if more than one variable has a P-value of 0, the variable with the largest F statistic is entered.

Stepwise Linear Regression - Backward Step

Each backward step seeks to remove the independent variable xi that least contributes to explaining the variance in the dependent variable y. The partial F statistic is calculated for each independent x variable in the model. If the smallest value is less than the criterion to remove, the value is removed.

As with forward steps, an option is provided to use the probability or P-value associated with the T-statistic. That is, the ratio of the b-coefficient to its standard error. In this case, all the probabilities or P-values are calculated for the variables currently in the model at one time, and the one with the largest P-value is removed if the value is greater than the criterion to remove.

Forward-Only Stepwise Linear Regression

The forward-only procedure consists solely of forward steps, starting without any independent x variables in the model. Forward steps are continued until no variables can be added to the model.

Forward Stepwise Linear Regression

The forward stepwise procedure is a combination of the forward and backward steps, starting without any independent x variables in the model. One forward step is followed by one backward step, and these single forward and backward steps are alternated until no variables can be added or removed.

Backward-Only Stepwise Linear Regression

The backward only procedure consists solely of backward steps, starting with all independent x variables in the model. Backward steps are continued until no variables can be removed from the model.

Backward Stepwise Linear Regression

The backward stepwise procedure is a combination of the backward and forward steps, starting with all of the independent x variables in the model. One backward step is followed by one forward step, and these single backward and forward steps are alternated until no variables can be added or removed.

Linear Regression and Missing Data

Null values for columns in a linear regression analysis can adversely affect results. Teradata recommends that the listwise deletion option is used when building the input matrix with the Matrix Building function. This makes sure that any row for which one of the columns is null is left out of the matrix computations completely. Listwise deletion is requested by setting the matrix parameter nullhandling=IGNORE. Additional strategies when building the input matrix are as follows:
  • Use the Recoding transformation function to build a new column, substituting a fixed known value for null values.
  • Use one of the analytic algorithms in Analytics Library to estimate replacement values for null values.
This technique is often called missing value imputation. Several schemes for replacing null values are available in the Variable Transformation function when using the Null Replacement transformation.