Background - Aster Analytics

Teradata Aster Analytics Foundation User Guide

Product

Aster Analytics

Release Number

6.21

Published

November 2016

Language

English (United States)

Last Update

2018-04-14

dita:mapPath

kiu1466024880662.ditamap

dita:ditavalPath

AA-notempfilter_pdf_output.ditaval

dita:id

B700-1021

lifecycle

Product Category

Software

When you have thousands of input variables, there is a high probability that some of them are linearly correlated. Reasons to reduce the thousands of potentially linearly correlated input variables to a few linearly uncorrelated variables, called principal components, are:

Some statistical analysis tools, such as linear regression, do not allow linearly correlated inputs.
High dimensionality causes many problems for statistical tools.

Given a data set with N observations and M variables, represented by an NxM matrix, PCA generates an MxM rotation matrix. Each column of the rotation matrix represents an axis in M-dimensional space. The first k columns are the k dimensions along which the data varies most (and thus in some cases are considered the most important). Discarding the remaining M-k columns leaves an Mxk rotation matrix. Multiplying the original NxM matrix by the Mxk rotation matrix produces an Nxk matrix that represents the data set with a reduced dimensionality of k, where k is less than or equal to M.

Each eigenvector (output row, less the last standard deviation column) is a weighting scheme over the original input variables; therefore, the linear combination of the original variables using this eigenvector is a principal component. The multiplication works because the length of the eigenvector is the same as the number of the original input variables. Selecting the first k eigenvectors produces k principal components with the k highest standard deviations (due to the eigenvector computation). These principal components are linearly uncorrelated and can be used as input variables in further analysis.

The rank of principal components decreases in standard deviation, and thus in significance. The first several principal components usually explain 80%–90% of the total variance, which is sufficient in most applications.