Factor Analysis | Vantage Analytics Library - Factor Analysis (PCA)

Consider a dataset with a number of correlated numeric variables used in a type of analysis, such as Linear Regression or Cluster analysis. Or perhaps to understand customer behavior in a fundamental way by discovering hidden structure and meaning in data. Factor analysis is used to reduce a number of correlated numeric variables into a smaller number of variables called factors. The new variables, or factors, should be conceptually meaningful if the goal to understand hidden structure and meaning is to be achieved. Meaningful factors not only give insight into the dynamics of a business, but help make any models built using these factors more explainable, which is a requirement for a useful analytic model.

There are two fundamental types of factor analysis: principal components and common factors. Analytics Library offers principal components, though the components are generally referred to as factors. The Analytics Library also offers factor rotations, both orthogonal and oblique, as post-processing principal components. Finally, automatic factor model scoring is offered using dynamically generated SQL.

Factor Analysis begins with a correlation or covariance matrix, by working with either centered and unscaled data, or centered and normalized data (such as, unit variance). The choice affects the scaling of resulting factor measures and scores. The algorithm first builds an Extended Sum of Squares and Cross-Products (ESSCP) matrix by calling the Teradata CALCMATRIX table operator, later deriving a correlation or covariance matrix. Alternatively, you can build a table representation of the ESSCP matrix one time and base processing on the matrix, which can be used repeatedly to avoid rebuilding each time a factor analysis is performed.

The primary source of information and formulae in this section is [Harman].

Principal Component Analysis

The goal of principal components analysis (PCA) is to account for the maximum amount of variance of the original data in the principal components created. Each of the original variables are expressed as a linear combination of the new principal components. Each principal component, from the first to the last, accounts for the maximum amount of remaining variances of the original variables. This allows some of the later components to be discarded and only the reduced set of components accounting for the desired amount of total variance to retain. If all the components were retained, then all of the variance would be explained.

A principal components solution has many desirable properties. First, the new components are independent of each other, that is, uncorrelated in statistical terminology or orthogonal in the terminology of linear algebra. Further, the principal components can be calculated directly, yielding a unique solution. This is true also of principal component scores, which can be calculated directly from the solution and are also inherently orthogonal or independent of each other.

Factor Loadings

The term factor loadings is sometimes used to refer to the coefficients of the linear combinations of factors that make up the original variables in a factor analysis model. The appropriate term for this, however, is the factor pattern. A factor loadings matrix is sometimes assumed to indicate the correlations between the factors and the original variables, for which the appropriate term is factor structure. The good news is that whenever factors are mutually orthogonal or independent of each other, the factor pattern P and the factor structure S are the same. They are related by the equation S = PQ where Q is the matrix of correlations between factors.

In the case of principal components analysis, factor loadings are labeled as component loadings and represent both factor pattern and structure. For other types of analysis, loadings are labeled as factor pattern but indicates structure also, unless a separate structure matrix is also given (as is the case after oblique rotations, described later).

Keeping the these caveats in mind, the component loadings, pattern, or structure matrix is interpreted for its structure properties in order to understand the meaning of each new factor variable. When the analysis is based on a correlation matrix, the loadings, pattern, or structure can be interpreted as a correlation matrix with the columns corresponding to the factors and the rows corresponding to the original variables. Like all correlations, the values range in absolute value from 0 to 1 with the higher values representing a stronger correlation or relationship between the variables and factors. By looking at these values, you get an idea of the meaning represented by each factor. Analytics Library stores these so-called factor loadings and other related values in results tables to make them available for scoring.

Factor Scores

In order to use a factor as a variable, a value called a factor score for each row or observation in the data must be assigned. A factor score is a linear combination of the original input variables (without a constant term), and the coefficients associated with the original variables are called factor weights. Analytics Library provides a scoring function that calculates the factor weights and creates a table of new factor score variables using dynamically generated SQL. The ability to automatically generate factor scores, regardless of the factor analysis or rotation options used, is one of the most powerful features of the Analytics Library Factor Analysis module.

Principal Components

As mentioned previously, the goal of PCA is to account for the maximum amount of variance of the original data in the independent principal components created. It was also stated that each of the original variables is expressed as a linear combination of the new principal components, and that each principal component in its turn, from the first to the last, accounts for the maximum amount of remaining variances of the original variables. These results are achieved by first finding the eigenvalues and eigenvectors of the covariance or correlation matrix of the input variables to be modeled. Although not ordinarily thought of in this way, when analyzing v numeric columns in a table in a relational database, one is in some sense working in a v-dimensional vector space corresponding to these columns. Back at the beginning of the previous century when principal components analysis was developed, this was no small task. Today, however, math library routines are available to perform these computations very efficiently.

Although it will not be attempted here to derive the mathematical solution to finding principal components, it might be helpful to state the following definition (for example, that a square matrix A has an eigenvalue λ and an eigenvector x if Ax = λx). Further, a v x v square symmetric matrix A has v pairs of eigenvalues and eigenvectors, λ1e1, λ2e2, …, λvev. It is further true that eigenvectors can be found so that they have unit length and are mutually orthogonal (such as, independent or uncorrelated), making them unique.

To return to the point at hand, the principal component loadings being sought are actually the covariance or correlation matrix eigenvectors multiplied by the square root of their respective eigenvalues. The step left out up to now is the reduction of these principal component loadings to a number fewer than the variables present at the start. This can be achieved by first ordering the eigenvalues, and their corresponding eigenvectors, from maximum to minimum in descending order, and then by throwing away those eigenvalues below a minimum threshold value, such as 1.0. An alternative technique is to retain a desired number of the largest components regardless of the magnitude of the eigenvalues. Analytics Library provides both of these options to the user by way of the eigenmin and numfactors parameters, respectively.

A final point worth noting is that the eigenvalues themselves turn out to be the variance accounted for by each principal component, allowing the computation of several variance-related measures and some indication of the effectiveness of the principal components model.

Factor Rotations

When computing principal components or factors, the new components or factors may not have recognizable meaning. Correlations are calculated between the new factors and the original input variables, which presumably have business meaning to the data analyst. But factor-variable correlations may not possess the subjective quality of simple structure. The idea behind simple structure is to express each component or factor in terms of fewer variables that are highly correlated with the factor (or vice versa), with the remaining variables largely uncorrelated with the factor. This makes it easier to understand the meaning of the components or factors in terms of the variables.

Factor rotations are offered to allow the data analyst to attempt to find simple structure and hence meaning in the new components or factors. Orthogonal rotations maintain the independence of the components or factors while aligning them differently with the data to achieve a particular simple structure goal. Oblique rotations relax the requirement for factor independence while more aggressively seeking better data alignment. Analytics Library offers several options for both orthogonal and oblique rotations.

Orthogonal Rotations

First consider orthogonal rotations, that is, rotations of a factor matrix A that result in a rotated factor matrix B by way of an orthogonal transformation matrix T (for example, B = AT). Remember that the nice thing about orthogonal rotations on a factor matrix is that the resulting factors scores are uncorrelated, a desirable property when the factors are going to be used in subsequent regression, cluster or other type of analysis. But how is simple structure obtained?

As described earlier, the idea behind simple structure is to express each component or factor in terms of fewer variables that are highly correlated with the factor, with the remaining variables not so correlated with the factor. The two most famous mathematical criteria for simple factor structure are the quartimax and varimax criteria. Simply put, the varimax criterion seeks to simplify the structure of columns or factors in the factor loading matrix, whereas the quartimax criterion seeks to simplify the structure of the rows or variables in the factor loading matrix. Less simply put, the varimax criterion seeks to maximize the variance of the squared loadings across the variables for all factors. The quartimax criterion seeks to maximize the variance of the squared loadings across the factors for all variables. The solution to either optimization problem is mathematically quite involved, though in principle it is based on fundamental techniques of linear algebra, differential calculus, and the use of the popular Newton-Raphson iterative technique for finding the roots of equations.

Regardless of the criterion used, rotations are performed on normalized loadings, that is prior to rotating, the rows of the factor loading matrix are set to unit length by dividing each element by the square root of the communality for that variable. The rows are unnormalized back to the original length after the rotation is performed. This has been found to improve results, particularly for the varimax method.

Fortunately both the quartimax and varimax criteria can be expressed in terms of the same equation containing a constant value that is 0 for quartimax and 1 for varimax. The orthomax criterion is then obtained simply by setting this constant, call it gamma, to any desired value, equamax corresponds to setting this constant to half the number of factors, and parsimax is given by setting the value of gamma to v(f-1) / (v+f+2) where v is the number of variables and f is the number of factors.

Oblique Rotations

As mentioned earlier, oblique rotations relax the requirement for factor independence that exists with orthogonal rotations, while more aggressively seeking better data alignment. Analytics Library uses a technique known as the indirect oblimin method. As with orthogonal rotations, there is a common equation for the oblique simple structure criterion that contains a constant that can be set for various effects. A value of 0 for this constant, call it gamma, yields the quartimin solution, which is the most oblique solution of those offered. A value of 1 yields the covarimin solution, the least oblique case. And a value of 0.5 yields the biquartimin solution, a compromise between the two. A solution known as orthomin can be achieved by setting the value of gamma to any desired positive value.

One of the distinctions of a factor solution that incorporates an oblique rotation is that the factor loadings must be thought of in terms of two different matrixes, the factor pattern P matrix and the factor structure matrix S. These are related by the equation S = PQ where Q is the matrix of correlations between factors. Obviously if the factors are not correlated, as in an unrotated solution or after an orthogonal rotation, then Q is the identity matrix and the structure and pattern matrix are the same. The result of an oblique rotation must include both the pattern matrix that describes the common factors and the structure matrix of correlations between the factors and original variables.

As with orthogonal rotations, oblique rotations are performed on normalized loadings that are restored to their original size after rotation. A unique characteristic of the indirect oblimin method of rotation is that it is performed on a reference structure based on the normals of the original factor space. There is no inherent value in this, but is in fact just a side effect of the technique. It means however that an oblique rotation results in a reference factor pattern, structure and rotation matrix that is then converted back into the original factor space as the final primary factor pattern, structure and rotation matrix.

Factor Analysis | Vantage Analytics Library - Factor Analysis (PCA) - Vantage Analytics Library