5.4.5 - Data Quality Reports - Teradata Warehouse Miner

Teradata Warehouse Miner User Guide - Volume 3Analytic Functions

Product
Teradata Warehouse Miner
Release Number
5.4.5
Published
February 2018
Language
English (United States)
Last Update
2018-05-04
dita:mapPath
yuy1504291362546.ditamap
dita:ditavalPath
ft:empty
A variety of data quality reports are available with the Teradata Warehouse Miner Linear Regression algorithm. Reports include:
  • Constant Variables
  • Variable Statistics
  • Detailed Collinearity Diagnostics
    • Eigenvalues of Unit Scaled X'X
    • Condition Indices
    • Variance Proportions
  • Near Dependency

Constant Variables

Before attempting to build a model the algorithm checks to see if any variables in the model have a constant value. This check is based on the standard deviation values derived from the SSCP matrix input to the algorithm. If a variable with a constant value (i.e., a standard deviation of zero) is detected, the algorithm stops and notifies the user while producing a Constant Variables Table report. After reading this report, the user may then remove the variables in the report from the model and execute the algorithm again.

It is possible that a variable may appear in the Constant Variables Table report that does not actually have a constant value in the data. This can happen when a column has extremely large values that are close together in value. In this case the standard deviation will appear to be zero due to precision loss and will be rejected as a constant column. The remedy for this is to re-scale the values in the column prior to building a matrix or doing the analysis. The Z-Score or the Rescale transformation functions may be used for this purpose.

Variable Statistics

The user may optionally request that a Variables Statistics Report be provided, giving the mean value and standard deviation of each variable in the model based on the SSCP matrix provided as input.

Detailed Collinearity Diagnostics

One of the conditions that can lead to a poor linear regression model is when the independent variables in the model are not independent of each other, that is, when they are collinear (highly correlated) with one another. Collinearity can be loosely defined as a condition where one variable is nearly a linear combination of one or more other variables, sometimes also called a near dependency. This leads to an ill conditioned matrix of variables.

Teradata Warehouse Miner provides an optional Detailed Collinearity Diagnostics report using a specialized technique described in [Belsley, Kuh and Welsch]. This technique involves performing a singular value decomposition of the independent x variables in the model in order to measure collinearity.

The analysis proceeds roughly as follows. In order to put all variables on an equal footing, the data is scaled so that each variable adds up to 1 when summed over all the observations or rows. In order to calculate the singular values of X (the rows of X are the observations), the mathematically equivalent square root of the eigenvalues of X T X are computed instead for practical reasons. The condition index of each eigenvalue is calculated as the square root of the ratio of the largest eigenvalue to the given eigenvalue, a value always 1 or greater. The variance decomposition of these eigenvalues is computed using the eigenvalues together with the eigenvectors associated with them. The result is a matrix giving, for each variable, the proportion of variance associated with each eigenvalue.

Large condition indices indicate a probable near dependency. A value of 10 may indicate a weak dependency, values of 15 to 30 may be considered a borderline dependency, above 30 worth investigating further, and above 100, a potentially damaging collinearity. As a rule of thumb, an eigenvalue with a condition index greater than 30 and an associated variance proportion of greater than 50% with two or more model variables implies that a collinearity problem exists. The somewhat subjective conclusions described here and the experiments they are based on are described in detail in [Belsley, Kuh and Welsch].

An example of the Detailed Collinearity Diagnostics report is given below.

Eigenvalues of Unit Scaled X'X
Factor 1 5.2029
Factor 2 .8393
Factor 3 .5754
Factor 4 .3764
Factor 5 4.1612E-03
Factor 6 1.8793E-03
Factor 7 2.3118E-08
Condition Indices
Factor 1 1
Factor 2 2.4898
Factor 3 3.007
Factor 4 3.718
Factor 5 35.3599
Factor 6 52.6169
Factor 7 15001.8594
Variance Proportions
Variable Name Factor 1 Factor 2 Factor 3 Factor 4 Factor 5 Factor 6 Factor 7
CONSTANT 1.3353E-09 1.0295E-08 1.3781E-09 1.6797E-08 1.1363E-11 2.1981E-07 1
cust_id 1.3354E-09 1.0296E-08 1.3782E-09 1.6799E-08 1.1666E-11 2.2068E-07 1
income 2.3079E-04 1.8209E-03 1.6879E-03 1.1292E-03 .9951 4.4773E-06 1.2957E-05
age 1.0691E-04 1.9339E-04 9.321E-05 1.7896E-03 1.56E-05 .9963 1.4515E-03
children 2.9943E-03 4.4958E-02 .2361 1.6499E-03 3.6043E-04 .713 9.1708E-04
combo1 2.3088E-04 1.8703E-03 1.6658E-03 1.1339E-03 .995 1.0973E-04 2.3525E-05
combo2 1.4002E-04 3.1477E-05 4.4942E-05 5.0407E-03 4.7784E-06 .9935 1.2583E-03

Near Dependency

In addition to or in place of the Detailed Collinearity Diagnostics report, the user may optionally request a Near Dependency report based on the automated application of the specialized criteria used in the aforementioned report. Requesting the Near Dependency report greatly simplifies the search for collinear variables or near dependencies in the data.

The user can specify the threshold value for the condition index (by default 30) and the variance proportion (by default, 0.5) such that a near dependency is reported. That is, if two or more variables have a variance proportion greater than the variance proportion threshold, for a condition index with value greater than the condition index threshold, the variables involved in the near dependency are reported along with their variance proportions, their means and their standard deviations. Near dependencies are reported in descending order based on their condition index value, and variables contributing to a near dependency are reported in descending order based on their variance proportion.

The following is an example of a Near Dependency report.

Near Dependency report (example)
Variable Name Factor Condition Index Variance Proportion Mean Standard Deviation
CONSTANT 7 15001.8594 1 * *
cust_id 7 15001.8594 1 1362987.891 293.5012
age 6 52.6169 .9963 33.744 22.3731
combo2 6 52.6169 .9935 25.733 23.4274
children 6 52.6169 .713 .534 1.0029
income 5 35.3599 .9951 16978.026 21586.8442
combo1 5 35.3599 .995 33654.602 43110.862