Linear Regression Model Statistics
Report Item | Description |
---|---|
rid | If the groupby option is used, rid is added as an index to the table and is incremented for each distinct value of the groupby column. |
Groupby columns | A column is generated for each groupby column. Within each column there are distinct values of the groupby columns for which a linear model was built. |
Total Observations | Number of rows originally summarized in the SSCP matrix that the linear regression analysis is based on. The number of observations reflects the row count after any rows were eliminated by listwise deletion (recommended) when the matrix was built. |
Total Sums of squares | Given by the equation TSS = ∑(y – )2 where y is the dependent variable that is being predicted and is its mean value. The Total Sums of squares is also called the "total sums of squares about the mean." The total sum of squares relation to the “due-to-regression sums of squares” and the “residual sums of squares” is given by TSS = DRS + RSS. This is a shorthand form of the fundamental equation of regression analysis:
where y is the dependent variable, is its mean value, and is its predicted value. |
Multiple Correlation Coefficient (R) | Correlation between the real dependent variable y values and the values predicted based on the independent x variables, are written as Ry · x1x2...xn, which is calculated in Analytics Library as the positive square root of the Squared Multiple Correlation Coefficient (R2) value. |
Squared Multiple Correlation Coefficient (R-squared) | Measure of the fit improvement given by the linear regression model over estimating the dependent variable y naïvely with the mean value of y: where TSS is the Total Sums of squares and RSS is the Residual Sums of squares. It has a value between 0 and 1, with 1 indicating the maximum improvement in fit over estimating y naïvely with the mean value of y. |
Adjusted R-squared | Variation of the Squared Multiple Correlation Coefficient (R2) that has been adjusted for the number of observations and independent variables in the model: where n is the number of observations and p is the number of independent variables (substitute n-p in the denominator if there is no constant term). |
Standard Error of Estimate | Square root of the average squared residual value over all the observations, such as: where y is the actual value of the dependent variable, is the predicted value, n is the number of observations, and p is the number of independent variables (substitute n-p in the denominator if there is no constant term). |
Regression Sums of squares | “Due-to-regression sums of squares” (DRS) referred to in the description of the Total Sums of squares, where it is pointed out that TSS = DRS + RSS. It is also the middle term in what is sometimes known as the fundamental equation of regression analysis: where y is the dependent variable, is the mean value and is the predicted value. |
Regression Degrees of Freedom | Number of independent variables in the linear regression model. Used in the calculation of the Regression Mean-Square. |
Regression Mean-Square | Regression Sums of squares divided by the Regression Degrees of Freedom. This value is also the numerator in the calculation of the Regression F Ratio. |
Regression F Ratio | A statistical test called an F-test is made to determine if all the independent x variables taken together explain a statistically significant amount of variation in the dependent variable y. This test is carried out on the F Ratio using the following formula: where:
|
Regression P-value | Probability or P-value associated with the statistical test on the Regression F Ratio. This statistical F-test determines if all the independent x variables taken together explain a statistically significant amount of variation in the dependent variable y. A value close to 0 indicates that they do. The hypothesis being tested, or null hypothesis, is that the coefficients in the model are all zero except the constant term (for instance, all the corresponding independent variables together contribute nothing to the model). The P-value in this case is the probability that the null hypothesis is true and the given F statistic has the value it has or smaller. A right tail test on the F distribution is performed with a 5% significance level used by convention. If the P-value is less than the significance level (for example, less than 0.05), the null hypothesis is rejected (for example, the coefficients taken together are significant and not all 0). |
Residual Sums of squares | Sum of the squared differences between the
dependent variable estimated by the model and the actual value of y,
over all of the rows: |
Residual Degrees of Freedom | Represented as n-p-1 where:
Used in the calculation of the Residual Mean-Square. |
Residual Mean-Square | Residual Sums of squares divided by the Residual Degrees of Freedom. This value is also the denominator in the calculation of the Regression F Ratio. |
Output Database | Name of the output database. |
Output Tablename | Name of the output table. |
Dependent | Name of the dependent variable column. |
Linear Regression Variables in Model Report
Report Item | Description |
---|---|
Groupby Value | If groupby is specified, the distinct values for which a linear model was built are added as part of the index here. |
Column Name | Each independent variable in the model is listed along with accompanying measures. The first independent variable listed is CONSTANT, a fixed value representing the constant term in the linear regression model. |
B Coefficient | Linear regression attempts to find the b- coefficients in the equation = b0 + b1x1 + bnxn in order to best predict the value of the dependent variable y based on the independent variables x1 to xn. The best values of the coefficients are defined as the values that minimize the sum of squared error values over all the observations. |
Standard Error | Standard error of the B Coefficient term of the linear regression model, measuring how accurate the B Coefficient term is over all the observations used to build the model. It is the basis for estimating a confidence interval for the B Coefficient value. |
T Statistic | Ratio of a B Coefficient value to its standard error (Std Error). Along with the associated t-distribution probability value, or P-value, it is used to assess the statistical significance of this term in the linear model. The easiest way to assess the significance of this term in the model is to check if the P-value is less than 0.05. You can look up the critical T Stat value in a two-tailed T distribution table with the probability of 0.95 and degrees of freedom, roughly the number of observations minus the number of variables. This shows that if the absolute value of T Stat is greater than 2, the model term is statistically significant. |
P-value | T-distribution probability value associated with the T-statistic (T Stat), that is, the ratio of the B-coefficient value to its standard error (Std Error). Use this value to assess the statistical significance of this term in the linear model. A value close to 0 implies statistical significance and means this term in the model is important. The hypothesis being tested, or null hypothesis, is that the coefficient in the model is actually zero (for instance, the corresponding independent variable contributes nothing to the model). The P-value in this case is the probability that the null hypothesis is true and the given T-statistic has the absolute value it has or smaller. A two-tailed test on the T-distribution is performed with a 5% significance level used by convention. If the P-value is less than the significance level (for example, less than 0.05), the null hypothesis is rejected (the coefficient is statistically significant and not 0). |
Squared Multiple Correlation Coefficient (R-squared) | Measure of the correlation of this, the kth variable with respect to the other independent variables in the model taken together. (Do not confuse this measure with the R2 measure of the same name that applies to the model taken as a whole). The value ranges from 0 to 1 with 0 indicating a lack of correlation and 1 indicating the maximum correlation. It is not calculated for the constant term in the model. Multiple correlation coefficients are presented in related forms such as variance inflation factors or tolerances. The variance inflation factor is given by the formula:
where Vk is the variance inflation factor and Rk2 is the squared multiple correlation coefficient for the kth independent variable. Tolerance is given by the formula Tk = 1 – Rk2 where Tk is the tolerance of the kth independent variable and Rk2 is as before. See "Multiple Correlation Coefficients" in Model Diagnostics for details on the limitations of using this measure to detect collinearity problems in the data. |
Lower | Lower value in the confidence interval for this coefficient and is based on the standard error value of the coefficient. For example, if the coefficient has a value of 6 and a confidence interval of 5 to 7, it means that according to the normal error distribution assumptions of the model, there is a 95% probability that the true population value of the coefficient is actually between 5 and 7. |
Upper | Upper value in the confidence interval for this coefficient based on the standard error value of the coefficient. For example, if the coefficient has a value of 6 and a confidence interval of 5 to 7, it means that according to the normal error distribution assumptions of the model, there is a 95% probability that the true population value of the coefficient is actually between 5 and 7. |
Standard Coefficient | Also referred to as beta-coefficient. Expresses the linear model in terms of the z-scores or standardized values of the independent variables. Standardized values cast each variable into units measuring the number of standard deviations away from the mean value for that variable. The advantage of examining standardized coefficients is that they are scaled equivalently so that their relative importance in the model can be more easily seen. |
Incremental R-squared | It is possible to calculate the Squared Multiple Correlation value of the model incrementally by considering the cumulative contributions of x variables added to the model one at a time, namely Ry · x1, Ry ·x1x2, …, Ry · x1x2...xn. These are called Incremental R2 values, and they give a measure of how much the addition of each x variable contributes to explaining the variation in y in the observations. |
Linear Regression Model
Report Item | Data Type | Description |
---|---|---|
Groupby Variable | User-Defined | A column is generated for each groupby column. Within each column there are distinct values of the groupby columns for which a linear model was built. |
partId | INTEGER | For each batch of XML in the second column that is 31000 bytes, partId is incremented. |
XmlModel | VARCHAR(31000) | A 31000 byte block of an XML representation of the model. This column is for scoring the model. Any requested reports appear here. |
To extract the XML into a viewable format, run the following query:
SELECT XMLSERIALIZE(Content X.Dot) as XMLText FROM (SELECT * FROM "outputdatabase"."outputtablename_txt") AS C, XMLTable ( '//*' PASSING CREATEXML(C.XmlModel) ) AS X ("Dot");