This SQL statement creates a table that contains the principal components that correspond to the three top-ranked eigenvectors—that is, the principal components that explain ~98% of the total variance. The statement calculates these components by linearly combining the input attributes for each patient (from Input) with the three top-ranked eigenvectors (shown in the table pca_health_ev).
This method of generating the principal components uses uncentered input, whereas the function PCAPlot_stub uses centered input that it computes internally. The two methods generate principal components that are different but equivalent. PCAPlot, which is more compact, is recommended when the data has many dimensions and requires many principal components.
CREATE TABLE pca_health_pc DISTRIBUTE BY REPLICATION AS SELECT pid, pca_1, pca_2, pca_3 FROM (SELECT pid, (a.age * b.age) + (a.bmi * b.bmi) + (a.bloodpressure * b.bloodpressure) + (a.glucose * b.glucose) + (a.strokes * b.strokes) + (a.cigarettes * b.cigarettes) + (a.insulin * b.insulin) + (a.hdl * b.hdl) as pca_1 FROM pca_health_ev a, patient_pca_input b WHERE a.component_rank = 1) a JOIN (SELECT pid, (a.age * b.age) + (a.bmi * b.bmi) + (a.bloodpressure * b.bloodpressure) + (a.glucose * b.glucose) + (a.strokes * b.strokes) + (a.cigarettes * b.cigarettes) + (a.insulin * b.insulin) + (a.hdl * b.hdl) as pca_2 FROM pca_health_ev a, patient_pca_input b WHERE a.component_rank = 2) b USING(pid) JOIN (SELECT pid, (a.age * b.age) + (a.bmi * b.bmi) + (a.bloodpressure * b.bloodpressure) + (a.glucose * b.glucose) + (a.strokes * b.strokes) + (a.cigarettes * b.cigarettes) + (a.insulin * b.insulin) + (a.hdl * b.hdl) as pca_3 FROM pca_health_ev a, patient_pca_input b WHERE a.component_rank = 3) c USING(pid) order by pid;