Model Diagnostics | Linear Regression | Vantage Analytics Library - Model Diagnostics - Vantage Analytics Library

Vantage Analytics Library User Guide

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
Lake
VMware
Product
Vantage Analytics Library
Release Number
2.2.0
Published
March 2023
Language
English (United States)
Last Update
2024-01-02
dita:mapPath
ibw1595473364329.ditamap
dita:ditavalPath
iup1603985291876.ditaval
dita:id
zyl1473786378775
Product Category
Teradata Vantage

Linear regression can compute rigorous, well understood measurements of the effectiveness of a model. Most of these measurements are based on probability and probability theory.

Goodness of Fit

One diagnostic for assessing model effectiveness is the residual sums of squares or sum of squared errors, RSS, which is the sum of the squared differences between the dependent variable "" estimated by the model and the actual value of y over all rows:


Residual sum of squares equation
A similar measure is the total sums of squares about the mean, TSS, which is based on a naive estimate of y, the mean value "":

RSS equation based on naive estimate

The squared multiple correlation coefficient R2 measures the improvement of the fit given by the linear regression model:


Squared multiple correlation coefficient

R2 has a value between 0 and 1, where 1 indicates the maximum improvement in fit over the naive estimate of y. R2 is the correlation between the real y values and the y values predicted using the independent x variable—that is, Ry · x1x2...xn (the positive square root of R2).

The following formula adjusts R2 for the number of observations and independent variables in the model, where n is the number of observations and p is the number of independent variables. If there is no constant term, change the denominator to n-p.


Squared multiple correlation coefficient variation

TSS - RSS is the due-to-regression sums of squares or DRS. The mean TSS equals the variation due to regression, DRS, plus the unexplained residual variation, RSS.

The following equation, called the fundamental equation of regression analysis, is the same as saying TSS = DRS + RSS:


Fundamental equation of regression analysis

The following F-test determines if the set of x variables explains a significant amount of variation in y:


F-test equation

The value meanDRS is DRS divided by its degree of freedom, p.

The value meanRSS is RSS divided by its degree of freedom, n-p-1.

Confidence Intervals and Standard Errors

The standard deviation of the sampling distribution of each b-coefficient determines its confidence interval. For example, if a b-coefficient has the value 6 and a 95% confidence interval of 5 to 7, the confidence interval contains the true population coefficient with a confidence coefficient of 95%. If you take repeated samples of the same size from the population, 95% of those samples contain the true value for the population coefficient.

The T-statistic (or Wald statistic) is the ratio of a b-coefficient value to its standard error. You can use the T-statistic with its associated t-distribution probability value to assess the statistical significance of this b-coefficient in the model.

Standardized Coefficients

Converting the least-squares estimates of the b-coefficients to standardized (or beta) coefficients and recomputing the b-coefficients recasts the model in terms of the Z-scores of the independent variables.

Standardized values cast each variable into units measuring the number of standard deviations away from the mean value for that variable. Because standardized values are scaled equivalently, it is easier to see their relative importance in the model. If you do not standardize values, it is difficult to compare the coefficient for a variable such as income to a variable such as age or the number of years an account has been open.

Incremental R-squared

You can calculate the value R2 incrementally by considering the cumulative contributions of x variables added to the model one at a time, that is: "", "". These incremental R2 values measure how much the addition of each x value contributes to explaining the variation in y in the observations and show the importance of the order in which the x variables are specified.

Multiple Correlation Coefficients

Each independent variable in the model has a squared multiple correlation coefficient with respect to the other independent variables in the model taken together. These values range from 0 (lack of correlation) to1 (maximum correlation).

A multiple correlation coefficient can be expressed as a variance inflation factor or tolerance. Where Rk2 is the squared multiple correlation coefficient of the kth independent variable:
  • This formula computes a variance inflation factor for the kth independent variable:

    Variance inflation factor formula
  • This formula computes tolerance for the kth independent variable:

    Tk = 1 - Rk2

When correlation values are high, multiple correlation coefficients may be of limited value as indicators of possible collinearity or near dependencies among variables. However, lower correlation values do not necessarily indicate absence of collinearity. Also, if several near dependencies exist, multiple correlation coefficients cannot distinguish between them. For more information about collinearity diagnostics, see Data Quality Reports and [Belsley, Kuh and Welsch].