Results Data | Linear Regression | Vantage Analytics Library - Results Data - Vantage Analytics Library

Vantage Analytics Library User Guide

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
Lake
VMware
Product
Vantage Analytics Library
Release Number
2.2.0
Published
March 2023
Language
English (United States)
Last Update
2024-01-02
dita:mapPath
ibw1595473364329.ditamap
dita:ditavalPath
iup1603985291876.ditaval
dita:id
zyl1473786378775
Product Category
Teradata Vantage
This function outputs one or more columns of XML. You can transform the XML to HTML, which is easier to view—see Reports.

Linear Regression Model Statistics

Table name = outputdatabase.outputtablename_rpt
Report Item Description
rid If the groupby option is used, rid is added as an index to the table and is incremented for each distinct value of the groupby column.
Groupby columns A column is generated for each groupby column. Within each column there are distinct values of the groupby columns for which a linear model was built.
Total Observations Number of rows originally summarized in the SSCP matrix that the linear regression analysis is based on. The number of observations reflects the row count after any rows were eliminated by listwise deletion (recommended) when the matrix was built.
Total Sums of squares Given by the equation TSS = ∑(y – "")2 where y is the dependent variable that is being predicted and "" is its mean value.
The Total Sums of squares is also called the "total sums of squares about the mean." The total sum of squares relation to the “due-to-regression sums of squares” and the “residual sums of squares” is given by TSS = DRS + RSS. This is a shorthand form of the fundamental equation of regression analysis:

Regression analysis equation

where y is the dependent variable, "" is its mean value, and "" is its predicted value.

Multiple Correlation Coefficient (R) Correlation between the real dependent variable y values and the values predicted based on the independent x variables, are written as Ry · x1x2...xn, which is calculated in Analytics Library as the positive square root of the Squared Multiple Correlation Coefficient (R2) value.
Squared Multiple Correlation Coefficient (R-squared) Measure of the fit improvement given by the linear regression model over estimating the dependent variable y naïvely with the mean value of y:

Squared multiple correlation coefficient equation

where TSS is the Total Sums of squares and RSS is the Residual Sums of squares. It has a value between 0 and 1, with 1 indicating the maximum improvement in fit over estimating y naïvely with the mean value of y.

Adjusted R-squared Variation of the Squared Multiple Correlation Coefficient (R2) that has been adjusted for the number of observations and independent variables in the model:

Adjusted R-squared equation

where n is the number of observations and p is the number of independent variables (substitute n-p in the denominator if there is no constant term).

Standard Error of Estimate Square root of the average squared residual value over all the observations, such as:

Square root of average squared residual value

where y is the actual value of the dependent variable, "" is the predicted value, n is the number of observations, and p is the number of independent variables (substitute n-p in the denominator if there is no constant term).

Regression Sums of squares “Due-to-regression sums of squares” (DRS) referred to in the description of the Total Sums of squares, where it is pointed out that TSS = DRS + RSS. It is also the middle term in what is sometimes known as the fundamental equation of regression analysis:

Regression sums of squares equation

where y is the dependent variable, "" is the mean value and "" is the predicted value.

Regression Degrees of Freedom Number of independent variables in the linear regression model. Used in the calculation of the Regression Mean-Square.
Regression Mean-Square Regression Sums of squares divided by the Regression Degrees of Freedom. This value is also the numerator in the calculation of the Regression F Ratio.
Regression F Ratio A statistical test called an F-test is made to determine if all the independent x variables taken together explain a statistically significant amount of variation in the dependent variable y. This test is carried out on the F Ratio using the following formula:

F-test equation
where:
  • meanDRS is the Regression Mean-Square
  • meanRSS is the Residual Mean-Square

    A large value of the F-ratio means that the model as a whole is statistically significant.

    The easiest way to assess the significance of this term in the model is to check if the associated Regression P-Value is less than 0.05. However, the critical value of the F Ratio can be looked up in an F distribution table. This value is very roughly in the range of 1 to 3, depending on the number of observations and variables.

Regression P-value Probability or P-value associated with the statistical test on the Regression F Ratio. This statistical F-test determines if all the independent x variables taken together explain a statistically significant amount of variation in the dependent variable y. A value close to 0 indicates that they do.

The hypothesis being tested, or null hypothesis, is that the coefficients in the model are all zero except the constant term (for instance, all the corresponding independent variables together contribute nothing to the model). The P-value in this case is the probability that the null hypothesis is true and the given F statistic has the value it has or smaller. A right tail test on the F distribution is performed with a 5% significance level used by convention. If the P-value is less than the significance level (for example, less than 0.05), the null hypothesis is rejected (for example, the coefficients taken together are significant and not all 0).

Residual Sums of squares Sum of the squared differences between the dependent variable estimated by the model and the actual value of y, over all of the rows:

Residual sums of squares equation
Residual Degrees of Freedom Represented as n-p-1 where:
  • n is the number of observations
  • p is the number of independent variables (or n-p if there is no constant term)

Used in the calculation of the Residual Mean-Square.

Residual Mean-Square Residual Sums of squares divided by the Residual Degrees of Freedom. This value is also the denominator in the calculation of the Regression F Ratio.
Output Database Name of the output database.
Output Tablename Name of the output table.
Dependent Name of the dependent variable column.

Linear Regression Variables in Model Report

Table name = outputdatabase.outputtablename
Report Item Description
Groupby Value If groupby is specified, the distinct values for which a linear model was built are added as part of the index here.
Column Name Each independent variable in the model is listed along with accompanying measures. The first independent variable listed is CONSTANT, a fixed value representing the constant term in the linear regression model.
B Coefficient Linear regression attempts to find the b- coefficients in the equation "" = b0 + b1x1 + bnxn in order to best predict the value of the dependent variable y based on the independent variables x1 to xn. The best values of the coefficients are defined as the values that minimize the sum of squared error values over all the observations.

B coefficient equation
Standard Error Standard error of the B Coefficient term of the linear regression model, measuring how accurate the B Coefficient term is over all the observations used to build the model. It is the basis for estimating a confidence interval for the B Coefficient value.
T Statistic Ratio of a B Coefficient value to its standard error (Std Error). Along with the associated t-distribution probability value, or P-value, it is used to assess the statistical significance of this term in the linear model.

The easiest way to assess the significance of this term in the model is to check if the P-value is less than 0.05. You can look up the critical T Stat value in a two-tailed T distribution table with the probability of 0.95 and degrees of freedom, roughly the number of observations minus the number of variables. This shows that if the absolute value of T Stat is greater than 2, the model term is statistically significant.

P-value T-distribution probability value associated with the T-statistic (T Stat), that is, the ratio of the B-coefficient value to its standard error (Std Error). Use this value to assess the statistical significance of this term in the linear model. A value close to 0 implies statistical significance and means this term in the model is important.

The hypothesis being tested, or null hypothesis, is that the coefficient in the model is actually zero (for instance, the corresponding independent variable contributes nothing to the model). The P-value in this case is the probability that the null hypothesis is true and the given T-statistic has the absolute value it has or smaller. A two-tailed test on the T-distribution is performed with a 5% significance level used by convention. If the P-value is less than the significance level (for example, less than 0.05), the null hypothesis is rejected (the coefficient is statistically significant and not 0).

Squared Multiple Correlation Coefficient (R-squared) Measure of the correlation of this, the kth variable with respect to the other independent variables in the model taken together. (Do not confuse this measure with the R2 measure of the same name that applies to the model taken as a whole). The value ranges from 0 to 1 with 0 indicating a lack of correlation and 1 indicating the maximum correlation. It is not calculated for the constant term in the model.
Multiple correlation coefficients are presented in related forms such as variance inflation factors or tolerances. The variance inflation factor is given by the formula:

Variance inflation factor formula

where Vk is the variance inflation factor and Rk2 is the squared multiple correlation coefficient for the kth independent variable. Tolerance is given by the formula Tk = 1 – Rk2 where Tk is the tolerance of the kth independent variable and Rk2 is as before.

See "Multiple Correlation Coefficients" in Model Diagnostics for details on the limitations of using this measure to detect collinearity problems in the data.

Lower Lower value in the confidence interval for this coefficient and is based on the standard error value of the coefficient. For example, if the coefficient has a value of 6 and a confidence interval of 5 to 7, it means that according to the normal error distribution assumptions of the model, there is a 95% probability that the true population value of the coefficient is actually between 5 and 7.
Upper Upper value in the confidence interval for this coefficient based on the standard error value of the coefficient. For example, if the coefficient has a value of 6 and a confidence interval of 5 to 7, it means that according to the normal error distribution assumptions of the model, there is a 95% probability that the true population value of the coefficient is actually between 5 and 7.
Standard Coefficient Also referred to as beta-coefficient. Expresses the linear model in terms of the z-scores or standardized values of the independent variables. Standardized values cast each variable into units measuring the number of standard deviations away from the mean value for that variable. The advantage of examining standardized coefficients is that they are scaled equivalently so that their relative importance in the model can be more easily seen.
Incremental R-squared It is possible to calculate the Squared Multiple Correlation value of the model incrementally by considering the cumulative contributions of x variables added to the model one at a time, namely Ry · x1, Ry ·x1x2, …, Ry · x1x2...xn. These are called Incremental R2 values, and they give a measure of how much the addition of each x variable contributes to explaining the variation in y in the observations.

Linear Regression Model

Table name=outputdatabase.outputtablename_txt
Report Item Data Type Description
Groupby Variable User-Defined A column is generated for each groupby column. Within each column there are distinct values of the groupby columns for which a linear model was built.
partId INTEGER For each batch of XML in the second column that is 31000 bytes, partId is incremented.
XmlModel VARCHAR(31000) A 31000 byte block of an XML representation of the model. This column is for scoring the model.

Any requested reports appear here.

To extract the XML into a viewable format, run the following query:

SELECT XMLSERIALIZE(Content X.Dot) as XMLText
FROM (SELECT * FROM "outputdatabase"."outputtablename_txt") AS C,
XMLTable (
'//*'
PASSING CREATEXML(C.XmlModel)
) AS X ("Dot");