5.4.5 - Principal Components - Teradata Warehouse Miner

Teradata Warehouse Miner User Guide - Volume 3Analytic Functions

Teradata Warehouse Miner
Release Number
February 2018
English (United States)
Last Update

As mentioned earlier in the introduction, the goal of principal components analysis (PCA) is to account for the maximum amount of the original data’s variance in the independent principal components created. It was also stated that each of the original variables is expressed as a linear combination of the new principal components, and that each principal component in its turn, from the first to the last, accounts for a maximum amount of the remaining sum of the variances of the original variables. These results are achieved by first finding the eigenvalues and eigenvectors of the covariance or correlation matrix of the input variables to be modeled. Although not ordinarily thought of in this way, when analyzing v numeric columns in a table in a relational database, one is in some sense working in a v-dimensional vector space corresponding to these columns. Back at the beginning of the previous century when principal components analysis was developed, this was no small task. Today, however, math library routines are available to perform these computations very efficiently.

Although it won’t be attempted here to derive the mathematical solution to finding principal components, it might be helpful to state the following definition (i.e., that a square matrix A has an eigenvalue λ and an eigenvector x if Ax = λx). Further, a v x v square symmetric matrix A has v pairs of eigenvalues and eigenvectors, λ1e1, λ2e2, …, λvev. It is further true that eigenvectors can be found so that they have unit length and are mutually orthogonal (i.e., independent or uncorrelated), making them unique.

To return to the point at hand, the principal component loadings that are being sought are actually the covariance or correlation matrix eigenvectors just described multiplied by the square root of their respective eigenvalues. The step left out up to now however is the reduction of these principal component loadings to a number fewer than the variables present at the start. This can be achieved by first ordering the eigenvalues, and their corresponding eigenvectors, from maximum to minimum in descending order, and then by throwing away those eigenvalues below a minimum threshold value, such as 1.0. An alternative technique is to retain a desired number of the largest components regardless of the magnitude of the eigenvalues. Teradata Warehouse Miner provides both of these options to the user. The user may further optionally request that the signs of the principal component loadings be inverted if there are more minus signs than positive ones. This is purely cosmetic and does not affect the solution in a substantive way. However, if signs are reversed, this must be kept in mind when attempting to interpret or assign conceptual meaning to the factors.

A final point worth noting is that the eigenvalues themselves turn out to be the variance accounted for by each principal component, allowing the computation of several variance related measures and some indication of the effectiveness of the principal components model.