PCA Example | Teradata Vantage - PCA Example - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
9.02
9.01
2.0
1.3
Published
February 2022
Language
English (United States)
Last Update
2022-02-10
dita:mapPath
rnn1580259159235.ditamap
dita:ditavalPath
ybt1582220416951.ditaval
dita:id
B700-4003
lifecycle
previous
Product Category
Teradata Vantageā„¢
This example uses the PCAMap and PCAReduce functions to calculate the principal components of a data set. The output displays the identified principal components, ranked according to standard deviation.

Input

The input table, patient_pca_input, contains medical data for 25 patients, identified by patient ID (pid).

Data attributes:

  • Age (years)
  • Body mass index (BMI) (kg/m2)
  • Blood pressure (mm Hg)
  • Blood glucose level (mg/dL)
  • Strokes (number experienced)
  • Cigarettes (number smoked/month)
  • Insulin (mg/dL)
  • High-density lipoproteins (HDL) (mg/dL)
patient_pca_input
pid age bmi bloodpressure glucose strokes cigarettes insulin hdl
1 50 33.6 72 148 6 35 0 62.7
2 31 26.6 66 85 1 29 0 35.1
3 32 23.3 64 183 8 0 0 67.2
4 21 28.1 66 89 1 23 94 16.7
5 33 43.1 40 137 0 35 168 228.8
6 30 25.6 74 116 5 0 0 20.1
7 26 31 50 78 3 32 88 24.8
8 29 35.3 0 115 10 0 0 13.4
9 53 30.5 70 197 2 45 543 15.8
10 54 0 96 125 8 0 0 23.2
11 30 37.6 92 110 4 0 0 19.1
12 34 38 74 168 10 0 0 53.7
13 57 27.1 80 139 10 0 0 144.1
14 59 30.1 60 189 1 23 846 39.8
15 51 25.8 72 166 5 19 175 58.7
16 32 30 0 100 7 0 0 48.4
17 31 45.8 84 118 0 47 230 55.1
18 31 29.6 74 107 7 0 0 25.4
19 33 43.3 30 103 1 38 83 18.3
20 32 34.6 70 115 1 30 96 52.9
21 27 39.3 88 126 3 41 235 70.4
22 50 35.4 84 99 8 0 0 38.8
23 41 39.8 90 196 7 0 0 45.1
24 29 29 80 119 9 35 0 26.3
25 51 36.6 94 143 11 33 146 25.4

Normalized Input

The reason to normalize the input variables before inputting them to the PCA function is that if some variables have much larger variance than others, they dominate in the first few principal components.

The following commands use the Scale and ScaleMap functions to create a table of normalized input variables, pca_scaled, from the patient_pca_input data set:

CREATE MULTISET TABLE tmpScale AS (
  SELECT * FROM ScaleMap (                                                                      ON patient_pca_input
    USING
    TargetColumns ('[1:8]')
    MissValue ('omit')
  ) AS dt
) WITH DATA;
CREATE MULTISET TABLE pca_scaled AS (
  SELECT * FROM Scale (
    ON patient_pca_input AS InputTable PARTITION BY ANY
    ON tmpScale AS statistic DIMENSION
    USING
    ScaleMethod ('std')
    Accumulate ('pid')
  ) AS dt
) WITH DATA;

SQL Calls

This query conducts a principal component analysis on the normalized input data:

SELECT * FROM PCAReduce (
  ON (SELECT * FROM  PCAMap (
  ON pca_scaled
  USING
  TargetColumns('[1:8]')
) AS dt1 ) PARTITION BY 1
) AS dt;

Output

 component_rank age                  bmi                 bloodpressure        glucose              strokes              cigarettes            insulin              hdl                   sd                  var_proportion       cumulative_var      mean                                                                                                                                                                          
 -------------- -------------------- ------------------- -------------------- -------------------- -------------------- --------------------- -------------------- --------------------- ------------------- -------------------- ------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
              1  0.08209456624137079 -0.3870046090920526  0.09345439689066688 -0.04216593563916513   0.5531712742637565   -0.5556501224007842 -0.43685302464081466   -0.1654996736437919   1.472109758037739  0.28217540881639736 0.28217540881639736 [-2.486899575160351E-16, -6.661338147750939E-16, 9.769962616701378E-17, -1.0658141036401502E-16, 1.7763568394002505E-17, -5.329070518200751E-17, 0.0, -5.329070518200751E-17]
              2  -0.6028733018225988 0.17861430770359402 -0.31903125250345443  -0.5700201930411378 -0.11893913442095008 -0.005196208048840971  -0.3943619845210589  -0.09078467587773241  1.4241934069709463   0.2641050599556657   0.546280468772063 NULL                                                                                                                                                                         
              3  0.02374359397742567 0.39186196324224787 -0.30591120403327365  0.23782383985742445   0.2173850060727078   -0.1967429451346994  -0.1922943077460684    0.7568338469137748   1.049099321858834   0.1433085139485242  0.6895889827205872 NULL                                                                                                                                                                         
              4 -0.09998801486200944 0.21435051698676735   0.8444287615246641  -0.0918928665696382 0.011305174477644377    0.2111352882378259  -0.3579033088614018    0.2230500507326031   0.922502599277917   0.1108087299055356  0.8003977126261228 NULL                                                                                                                                                                         
              5 -0.14085382468973354  0.6716168063561156 0.015501081352144553   0.3351077534037482  0.34696752296785255 -0.038650465429363866 0.043038964135395014   -0.5411292846166164  0.8339617788814034  0.09055888654102022   0.890956599167143 NULL                                                                                                                                                                         
              6   0.5273985574183678  0.1083454629714406 -0.19613416785828364 -0.41491662766966964   0.4324034630774445    0.5429592283215245 -0.12781293754536002  -0.03706719659550343  0.6386214541768617 0.053103823142574176  0.9440604223097172 NULL                                                                                                                                                                         
              7   0.4246146518729586  0.3900261133262609  0.13557114512538138 -0.47994711699092446  -0.2574558219181479    -0.557165978164135  0.20322505229037413 -0.028903434023275044  0.5350610892169184   0.0372773918221348   0.981337814131852 NULL                                                                                                                                                                         
              8  0.37544142271498704  0.0624033352765424 -0.16048460010093996  0.30576508343845227   -0.508751667006649  0.034749606866136186  -0.6550398914761275  -0.21630745584000433 0.37858365979975517 0.018662185868148023                 1.0 NULL                                                                    

Output for Unnormalized Input

If you run the model without first normalizing the input, you get the output in the following table.

The loadings for each principal component are different. For example, for the first principal component, which captures the most variance among input space, insulin has a much larger loading than every other variable. This is because the raw insulin values, ranging from 70 to 190, are much larger in magnitude than the values for most of the other variables (for example, BMI is usually 18 to 30).

Teradata usually recommends normalization when variables have different variances. However, in some data sets, the magnitudes of input variables do matter, and normalization is not recommended.

 component_rank age                   bmi                    bloodpressure         glucose              strokes                cigarettes            insulin               hdl                    sd                 var_proportion        cumulative_var     mean                                                                  
 -------------- --------------------- ---------------------- --------------------- -------------------- ---------------------- --------------------- --------------------- ---------------------- ------------------ --------------------- ------------------ --------------------------------------------------------------------- 
              1 -0.023353196034411227 -0.0050495648788435234 -0.001586693912675234 -0.08420322566619312    0.00859257029825338  -0.04133292194971656   -0.9952634091735024 -0.0022210309439499357 194.71738506207055    0.9035607310846709 0.9035607310846709 [37.88, 31.964000000000002, 66.8, 130.84, 5.12, 18.6, 108.16, 49.172]
              2   0.04018597618816406    0.04041268203719046  -0.02234490883189587  0.21768528557138409 -0.0034452736055330785   0.01718483933772813 -0.022445681193031374     0.9736810189820199  46.33698444340015   0.05116858707616014  0.954729318160831 NULL                                                                 
              3   0.16820726416678253   -0.06380548610918492   0.47282409285264465   0.8207662481650843    0.05397534805226847  -0.18045788796064954  -0.06546598054699179   -0.17507466173441805  32.12651862361091  0.024596608174712386 0.9793259263355433 NULL                                                                 
              4 -0.012965205864271578  0.0037804534935903873   -0.8583798540505494  0.40689903096324975    0.03913361827590789  -0.29046477452424657 -0.020135596705586265   -0.10549004121778502 23.059456330930484   0.01267202492569593 0.9919979512612392 NULL                                                                 
              5  -0.23359815217156288    0.32078777092180283  -0.12435451993064969   0.3099450633949961  -0.056895698485650424    0.8461243985086968  -0.05759515026265747   -0.09228380566106692 14.473486862384386  0.004992225949626681 0.9969901772108658 NULL                                                                 
              6    0.7433370943722172    -0.5151426717933871  -0.15319563481636883 -0.02823965187653383    0.07056464778179963   0.38970912809735386 -0.027739402093030042   -0.01376818417563162  8.892127729665095 0.0018843400370318058 0.9988745172478976 NULL                                                                 
              7     0.585637286534736     0.7907071454804706 -0.005530068854870849 -0.09254912263578012    0.10792172202547658  -0.10181406223812656 -0.004677778713769677    -0.0343536603012951  6.538193488247754 0.0010187401634787858 0.9998932574113765 NULL                                                                 
              8   0.13888402112454176    0.02759219001059113 -0.012492398266161555 0.029497751534728455    -0.9877276605628905 -0.053811065310627085  -0.01213027272971143   -0.01658362623945359   2.11638619445855  1.067425886235164E-4                1.0 NULL                                                                 

Download a zip file of all examples and a SQL script file that creates their input tables.