Input
The input table, patient_pca_input, contains medical data for 25 patients, identified by patient ID (pid).
Data attributes:
- Age (years)
- Body mass index (BMI) (kg/m2)
- Blood pressure (mm Hg)
- Blood glucose level (mg/dL)
- Strokes (number experienced)
- Cigarettes (number smoked/month)
- Insulin (mg/dL)
- High-density lipoproteins (HDL) (mg/dL)
pid | age | bmi | bloodpressure | glucose | strokes | cigarettes | insulin | hdl |
---|---|---|---|---|---|---|---|---|
1 | 50 | 33.6 | 72 | 148 | 6 | 35 | 0 | 62.7 |
2 | 31 | 26.6 | 66 | 85 | 1 | 29 | 0 | 35.1 |
3 | 32 | 23.3 | 64 | 183 | 8 | 0 | 0 | 67.2 |
4 | 21 | 28.1 | 66 | 89 | 1 | 23 | 94 | 16.7 |
5 | 33 | 43.1 | 40 | 137 | 0 | 35 | 168 | 228.8 |
6 | 30 | 25.6 | 74 | 116 | 5 | 0 | 0 | 20.1 |
7 | 26 | 31 | 50 | 78 | 3 | 32 | 88 | 24.8 |
8 | 29 | 35.3 | 0 | 115 | 10 | 0 | 0 | 13.4 |
9 | 53 | 30.5 | 70 | 197 | 2 | 45 | 543 | 15.8 |
10 | 54 | 0 | 96 | 125 | 8 | 0 | 0 | 23.2 |
11 | 30 | 37.6 | 92 | 110 | 4 | 0 | 0 | 19.1 |
12 | 34 | 38 | 74 | 168 | 10 | 0 | 0 | 53.7 |
13 | 57 | 27.1 | 80 | 139 | 10 | 0 | 0 | 144.1 |
14 | 59 | 30.1 | 60 | 189 | 1 | 23 | 846 | 39.8 |
15 | 51 | 25.8 | 72 | 166 | 5 | 19 | 175 | 58.7 |
16 | 32 | 30 | 0 | 100 | 7 | 0 | 0 | 48.4 |
17 | 31 | 45.8 | 84 | 118 | 0 | 47 | 230 | 55.1 |
18 | 31 | 29.6 | 74 | 107 | 7 | 0 | 0 | 25.4 |
19 | 33 | 43.3 | 30 | 103 | 1 | 38 | 83 | 18.3 |
20 | 32 | 34.6 | 70 | 115 | 1 | 30 | 96 | 52.9 |
21 | 27 | 39.3 | 88 | 126 | 3 | 41 | 235 | 70.4 |
22 | 50 | 35.4 | 84 | 99 | 8 | 0 | 0 | 38.8 |
23 | 41 | 39.8 | 90 | 196 | 7 | 0 | 0 | 45.1 |
24 | 29 | 29 | 80 | 119 | 9 | 35 | 0 | 26.3 |
25 | 51 | 36.6 | 94 | 143 | 11 | 33 | 146 | 25.4 |
Normalized Input
The reason to normalize the input variables before inputting them to the PCA function is that if some variables have much larger variance than others, they dominate in the first few principal components.
The following commands use the Scale and ScaleMap functions to create a table of normalized input variables, pca_scaled, from the patient_pca_input data set:
CREATE MULTISET TABLE tmpScale AS ( SELECT * FROM ScaleMap ( ON patient_pca_input USING InputColumns ('[1:8]') MissValue ('omit') ) AS dt ) WITH DATA;
CREATE MULTISET TABLE pca_scaled AS ( SELECT * FROM Scale ( ON patient_pca_input AS "input" PARTITION BY ANY ON tmpScale AS statistic DIMENSION USING "Method" ('std') Accumulate ('pid') ) AS dt ) WITH DATA;
SQL Calls
These queries conduct a principal component analysis on the normalized input data:
CREATE MULTISET TABLE tmp1 AS ( SELECT * FROM PCAMap ( ON pca_scaled USING TargetColumns ('[1:8]') ) AS dt ) WITH DATA;
CREATE MULTISET TABLE pca_health_ev_scaled AS ( SELECT * FROM PCAReduce ( ON tmp1 PARTITION BY 1 ) AS dt ) WITH DATA;
Output
component_rank | age | bmi | bloodpressure | glucose | strokes | cigarettes | insulin | hdl | sd | var_proportion | cumulative_var | mean |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 |
-0.0820945759812973 |
0.38700460531683 | -0.0934544029874236 | 0.0421659209573654 | -0.553171280035692 | 0.555650122648362 | 0.436853023913463 | 0.165499659734682 | 1.47210975433238 | 0.282175407395904 | 0.28217540739590 | [-2.753353101070388E-16, 8.881784197001253E-17, 1.0658141036401502E-16, -7.993605777301127E-17, 8.881784197001253E-18, -1.3322676295501878E-17, 7.771561172376097E-18, -1.7763568394002505E-17] |
2 | 0.60287329949617 | -0.178614306617327 | 0.319031250574667 | 0.570020193646336 | 0.11893912303648 | 0.00519622025475551 | 0.394361991280982 | 0.0907846812939654 | 1.42419340817409 | 0.26410506040189 | 0.546280467797795 | |
3 | -0.0237436026670202 | -0.391861948508353 | 0.305911231189858 | -0.237823829752043 | -0.217384991305528 | 0.196742944639624 | 0.19229428636829 | -0.756833856270543 | 1.04909931627422 | 0.143308512422791 | 0.689588980220586 | |
4 | 0.0999880092973573 | -0.214350502528417 | -0.84442875131615 | 0.0918928751867792 | -0.0113051638263718 | -0.211135286383886 | 0.35790331144032 | -0.223050100375743 | 0.92250259655368 | 0.110808729251078 | 0.800397709471664 | |
5 | 0.140853818736655 | -0.671616834838167 | -0.015501119425109 | -0.335107745512634 | -0.346967521387127 | 0.0386504810638121 | -0.0430389572065949 | 0.541129255060056 | 0.833961785351462 | 0.0905588879461716 | 0.890956597417836 | |
6 | 0.527398559607415 | 0.108345471463309 | -0.196134163484707 | -0.414916634531223 | 0.432403465056164 | 0.542959220893432 | -0.127812929085142 | -0.0370672018602792 | 0.638621453866185 | 0.0531038230909063 | 0.944060420508742 | |
7 | 0.424614643879208 | 0.390026086413646 | 0.135571152866786 | -0.479947126632426 | -0.257455827830989 | -0.557165985271577 | 0.20322506790708 | -0.0289034187287952 | 0.535061102539381 | 0.037277393678471 | 0.981337814187213 | |
8 | 0.375441433452479 | 0.062403350907142 | -0.160484597462421 | 0.30576507380862 | -0.50875166634292 | 0.0347495998643894 | -0.655039890019023 | -0.216307455362209 | 0.378583659238226 | 0.0186621858127871 | 1 |
Output for Unnormalized Input
If you run the model without first normalizing the input, you get the output in the following table.
The loadings for each principal component are different. For example, for the first principal component, which captures the most variance among input space, insulin has a much larger loading than every other variable. This is because the raw insulin values, ranging from 70 to 190, are much larger in magnitude than the values for most of the other variables (for example, BMI is usually 18 to 30).
Teradata usually recommends normalization when variables have different variances. However, in some datasets, the magnitudes of input variables do matter, and normalization is not recommended.
component_rank | age | bmi | bloodpressure | glucose | strokes | cigarettes | insulin | hdl | sd | var_proportion | cumulative_var | mean |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0.0233531960338227 | 0.00504956503997912 | 0.00158669391367759 | 0.0842032256610642 | -0.00859257029795621 | 0.0413329219482391 | 0.99526340917443 | 0.00222103039038422 | 194.717385062003 | 0.903560729160365 | 0.903560729160365 | 194.717385062003 |
2 | 0.0401859758008148 | 0.0404126752942246 | -0.0223449106172612 | 0.217685269075731 | -0.00344527374716509 | 0.0171848390743044 | -0.022445679214246 | 0.973681022974589 | 46.3369853980967 | 0.0511685890757095 | 0.954729318236074 | |
3 | -0.16820726382671 | 0.0638054862550787 | -0.472824089141384 | -0.820766253634683 | -0.0539753477374444 | 0.180457886400629 | 0.0654659811769922 | 0.175074647858464 | 32.126518678328 | 0.0245966082061308 | 0.979325926442205 | |
4 | -0.0129652073386729 | 0.00378045426074681 | -0.858379856202759 | 0.406899030303581 | 0.0391336176479417 | -0.290464771196449 | -0.0201355968289739 | -0.105490035413321 | 23.0594563502538 | 0.0126720249199549 | 0.99199795136216 | |
5 | -0.233598151299198 | 0.320787735507453 | -0.124354522206257 | 0.309945063989336 | -0.0568956998524478 | 0.846124412456638 | -0.0575951508612638 | -0.0922837968094822 | 14.4734867683913 | 0.00499222587415763 | 0.996990177236318 | |
6 | -0.743337103992336 | 0.515142674213825 | 0.153195632015632 | 0.0282396614323736 | -0.070564649365954 | -0.389709106729591 | 0.0277394005213322 | 0.0137681856551496 | 8.8921277044859 | 0.00188434002234855 | 0.998874517258666 | |
7 | -0.585637274413613 | -0.7907071584919 | 0.00553007147741568 | 0.0925491110066536 | -0.107921723264969 | 0.101814040327821 | 0.00467778052427058 | 0.0343536591559544 | 6.5381934611736 | 0.00101874015287284 | 0.999893257411539 | |
8 | -0.13888402260094 | -0.0275921930861039 | 0.0124923981893404 | -0.0294977512854948 | 0.987727660277147 | 0.0538110654780203 | 0.0121302727639756 | 0.016583625709794 | 2.11638619509984 | 0.00010674258846095 | 1 |