PCA Example - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
8.00
1.0
Published
May 2019
Language
English (United States)
Last Update
2019-11-22
dita:mapPath
blj1506016597986.ditamap
dita:ditavalPath
blj1506016597986.ditaval
dita:id
B700-4003
lifecycle
previous
Product Category
Teradata Vantageā„¢
This example uses the PCAMap and PCAReduce functions to calculate the principal components of a data set. The output displays the identified principal components, ranked according to standard deviation.

Input

The input table, patient_pca_input, contains medical data for 25 patients, identified by patient ID (pid).

Data attributes:

  • Age (years)
  • Body mass index (BMI) (kg/m2)
  • Blood pressure (mm Hg)
  • Blood glucose level (mg/dL)
  • Strokes (number experienced)
  • Cigarettes (number smoked/month)
  • Insulin (mg/dL)
  • High-density lipoproteins (HDL) (mg/dL)
patient_pca_input
pid age bmi bloodpressure glucose strokes cigarettes insulin hdl
1 50 33.6 72 148 6 35 0 62.7
2 31 26.6 66 85 1 29 0 35.1
3 32 23.3 64 183 8 0 0 67.2
4 21 28.1 66 89 1 23 94 16.7
5 33 43.1 40 137 0 35 168 228.8
6 30 25.6 74 116 5 0 0 20.1
7 26 31 50 78 3 32 88 24.8
8 29 35.3 0 115 10 0 0 13.4
9 53 30.5 70 197 2 45 543 15.8
10 54 0 96 125 8 0 0 23.2
11 30 37.6 92 110 4 0 0 19.1
12 34 38 74 168 10 0 0 53.7
13 57 27.1 80 139 10 0 0 144.1
14 59 30.1 60 189 1 23 846 39.8
15 51 25.8 72 166 5 19 175 58.7
16 32 30 0 100 7 0 0 48.4
17 31 45.8 84 118 0 47 230 55.1
18 31 29.6 74 107 7 0 0 25.4
19 33 43.3 30 103 1 38 83 18.3
20 32 34.6 70 115 1 30 96 52.9
21 27 39.3 88 126 3 41 235 70.4
22 50 35.4 84 99 8 0 0 38.8
23 41 39.8 90 196 7 0 0 45.1
24 29 29 80 119 9 35 0 26.3
25 51 36.6 94 143 11 33 146 25.4

Normalized Input

The reason to normalize the input variables before inputting them to the PCA function is that if some variables have much larger variance than others, they dominate in the first few principal components.

The following commands use the Scale and ScaleMap functions to create a table of normalized input variables, pca_scaled, from the patient_pca_input data set:

CREATE MULTISET TABLE tmpScale AS (
  SELECT * FROM ScaleMap (                                                                      ON patient_pca_input
    USING
    InputColumns ('[1:8]')
    MissValue ('omit')
  ) AS dt
) WITH DATA;
CREATE MULTISET TABLE pca_scaled AS (
  SELECT * FROM Scale (
    ON patient_pca_input AS "input" PARTITION BY ANY
    ON tmpScale AS statistic DIMENSION
    USING
    "Method" ('std')
    Accumulate ('pid')
  ) AS dt
) WITH DATA;

SQL Calls

These queries conduct a principal component analysis on the normalized input data:

CREATE MULTISET TABLE tmp1 AS (
  SELECT * FROM PCAMap (
    ON pca_scaled
    USING
    TargetColumns ('[1:8]')
  ) AS dt
) WITH DATA;
CREATE MULTISET TABLE pca_health_ev_scaled AS (
  SELECT * FROM PCAReduce (
    ON tmp1 PARTITION BY 1
  ) AS dt
) WITH DATA;

Output

pca_health_ev_scaled
component_rank age bmi bloodpressure glucose strokes cigarettes insulin hdl sd var_proportion cumulative_var mean
1

-0.0820945759812973

0.38700460531683 -0.0934544029874236 0.0421659209573654 -0.553171280035692 0.555650122648362 0.436853023913463 0.165499659734682 1.47210975433238 0.282175407395904 0.28217540739590 [-2.753353101070388E-16, 8.881784197001253E-17, 1.0658141036401502E-16, -7.993605777301127E-17, 8.881784197001253E-18, -1.3322676295501878E-17, 7.771561172376097E-18, -1.7763568394002505E-17]
2 0.60287329949617 -0.178614306617327 0.319031250574667 0.570020193646336 0.11893912303648 0.00519622025475551 0.394361991280982 0.0907846812939654 1.42419340817409 0.26410506040189 0.546280467797795  
3 -0.0237436026670202 -0.391861948508353 0.305911231189858 -0.237823829752043 -0.217384991305528 0.196742944639624 0.19229428636829 -0.756833856270543 1.04909931627422 0.143308512422791 0.689588980220586  
4 0.0999880092973573 -0.214350502528417 -0.84442875131615 0.0918928751867792 -0.0113051638263718 -0.211135286383886 0.35790331144032 -0.223050100375743 0.92250259655368 0.110808729251078 0.800397709471664  
5 0.140853818736655 -0.671616834838167 -0.015501119425109 -0.335107745512634 -0.346967521387127 0.0386504810638121 -0.0430389572065949 0.541129255060056 0.833961785351462 0.0905588879461716 0.890956597417836  
6 0.527398559607415 0.108345471463309 -0.196134163484707 -0.414916634531223 0.432403465056164 0.542959220893432 -0.127812929085142 -0.0370672018602792 0.638621453866185 0.0531038230909063 0.944060420508742  
7 0.424614643879208 0.390026086413646 0.135571152866786 -0.479947126632426 -0.257455827830989 -0.557165985271577 0.20322506790708 -0.0289034187287952 0.535061102539381 0.037277393678471 0.981337814187213  
8 0.375441433452479 0.062403350907142 -0.160484597462421 0.30576507380862 -0.50875166634292 0.0347495998643894 -0.655039890019023 -0.216307455362209 0.378583659238226 0.0186621858127871 1  

Output for Unnormalized Input

If you run the model without first normalizing the input, you get the output in the following table.

The loadings for each principal component are different. For example, for the first principal component, which captures the most variance among input space, insulin has a much larger loading than every other variable. This is because the raw insulin values, ranging from 70 to 190, are much larger in magnitude than the values for most of the other variables (for example, BMI is usually 18 to 30).

Teradata usually recommends normalization when variables have different variances. However, in some datasets, the magnitudes of input variables do matter, and normalization is not recommended.

component_rank age bmi bloodpressure glucose strokes cigarettes insulin hdl sd var_proportion cumulative_var mean
1 0.0233531960338227 0.00504956503997912 0.00158669391367759 0.0842032256610642 -0.00859257029795621 0.0413329219482391 0.99526340917443 0.00222103039038422 194.717385062003 0.903560729160365 0.903560729160365 194.717385062003
2 0.0401859758008148 0.0404126752942246 -0.0223449106172612 0.217685269075731 -0.00344527374716509 0.0171848390743044 -0.022445679214246 0.973681022974589 46.3369853980967 0.0511685890757095 0.954729318236074  
3 -0.16820726382671 0.0638054862550787 -0.472824089141384 -0.820766253634683 -0.0539753477374444 0.180457886400629 0.0654659811769922 0.175074647858464 32.126518678328 0.0245966082061308 0.979325926442205  
4 -0.0129652073386729 0.00378045426074681 -0.858379856202759 0.406899030303581 0.0391336176479417 -0.290464771196449 -0.0201355968289739 -0.105490035413321 23.0594563502538 0.0126720249199549 0.99199795136216  
5 -0.233598151299198 0.320787735507453 -0.124354522206257 0.309945063989336 -0.0568956998524478 0.846124412456638 -0.0575951508612638 -0.0922837968094822 14.4734867683913 0.00499222587415763 0.996990177236318  
6 -0.743337103992336 0.515142674213825 0.153195632015632 0.0282396614323736 -0.070564649365954 -0.389709106729591 0.0277394005213322 0.0137681856551496 8.8921277044859 0.00188434002234855 0.998874517258666  
7 -0.585637274413613 -0.7907071584919 0.00553007147741568 0.0925491110066536 -0.107921723264969 0.101814040327821 0.00467778052427058 0.0343536591559544 6.5381934611736 0.00101874015287284 0.999893257411539  
8 -0.13888402260094 -0.0275921930861039 0.0124923981893404 -0.0294977512854948 0.987727660277147 0.0538110654780203 0.0121302727639756 0.016583625709794 2.11638619509984 0.00010674258846095 1