Step 1: Create Smaller Data Set - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
8.00
1.0
Published
May 2019
Language
English (United States)
Last Update
2019-11-22
dita:mapPath
blj1506016597986.ditamap
dita:ditavalPath
blj1506016597986.ditaval
dita:id
B700-4003
lifecycle
previous
Product Category
Teradata Vantageā„¢

Input

The following statement creates the input table, pca_health_ev2, from the table pca_scaled (created in PCA Example), using only the variables age, bmi, bloodpressure, glucose, cigarettes, insulin, and hdl:

CREATE MULTISET TABLE pca_health_ev2 AS (
  SELECT * FROM PCAReduce (
    ON PCAMap (
      ON pca_scaled
      USING
      TargetColumns ('age','bmi','bloodpressure','glucose','cigarettes', 'insulin','hdl')
    ) PARTITION BY 1
  ) AS dt2
) WITH DATA;

This query returns the following table:

SELECT * FROM pca_health_ev2;
pca_health_ev2
component_rank age bmi bloodpressure glucose cigarettes insulin hdl sd var_proportion cumulative_var mean
1 0.558383429424359 -0.0605952043265538 0.279191393011342 0.562811446671035 0.156485831353259 0.497004566613813 0.135389267935288 1.42680503704651 0.302942353235311 0.302942353235311 [-2.753353101070388E-16, 8.881784197001253E-17, 1.0658141036401502E-16, -7.993605777301127E-17, -1.3322676295501878E-17, 7.771561172376097E-18, -1.7763568394002505E-17]
2 -0.250537483690061 0.614686329749836 -0.258383085318548 -0.0474790977051386 0.579594554667163 0.305498725264089 0.247217076371136 1.27431560304133 0.241648847642051 0.544591200877363  
3 -0.0911563605166551 -0.141748695621042 0.251924555111245 -0.220162314985357 0.392249448405349 0.284736376332311 -0.790396081112643 1.01834410187873 0.154319153248689 0.698910354126052  
4 -0.0965156342595097 0.203200180598669 0.843333872842729 -0.0927285011581401 0.201301142840861 -0.36378506450842 0.238164018417497 0.92244700110317 0.126623284203011 0.825533638329062  
5 -0.180835348784396 0.599972991222924 0.079755498943645 0.485460834090224 -0.419465995079716 -0.0691208326400593 -0.42961426285989 0.780266259616559 0.0905975351035737 0.916131173432636  
6 0.741419465051456 0.431342026006021 -0.0814633949974965 -0.452698251899937 -0.0606741267486036 -0.16172687758833 -0.151103636140504 0.573696298386201 0.0489772980330401 0.965108471465676  
7 -0.159404889398745 0.105272966756617 0.260029076069335 -0.428148411420131 -0.51533240361692 0.644444986031744 0.178058065989532 0.484222130587456 0.0348915285343237 1  

SQL Call

This query uses the PCAScore function to compute the first three score variables, which correspond to the first three principal components, accumulating the response variable from the original input data table patient_pca_input:

CREATE MULTISET TABLE pcr1 AS (
  SELECT * FROM PCAScore (
    ON pca_scaled AS InputTable PARTITION BY ANY
    ON pca_health_ev2 AS pca_table DIMENSION ORDER BY component_rank
    USING
    Accumulate ('pid', 'strokes')
    Components (3)
  ) AS dt
) WITH DATA;

Output

pcr1
pid strokes principal_component_1 principal_component_2 principal_component_3
3 0.795410174399195 0.195110284571174 -1.17825238356938 -1.04290585569861
6 -0.0331420905999665 -1.02505176519817 -1.2375999732885 0.257874309966306
8 1.3477783510653 -1.98240922334734 0.159007920294922 -0.494890984515023
11 -0.30932617893302 -1.00884248343362 -0.597930103709463 0.301060051568942
13 1.3477783510653 1.091766978011 -1.16460787259922 -2.17662153529429
16 0.519226086066141 -1.93840477808707 -0.0623801090526926 -0.936238178663301
19 -1.13787844393218 -1.14924574475721 1.7169286389402 0.588543776698959
21 -0.585510267266074 0.144749031593665 1.56905010871958 0.52484021423205
24 1.07159426273225 -0.65879037722031 0.120392373582135 0.915771821734939
2 -1.13787844393218 -1.27086507768138 -0.0479281470754135 0.732287038132982
5 -1.41406253226524 0.308542873080395 2.70938975527651 -3.04594531776213
7 -0.585510267266074 -1.61658595260732 0.71226267926454 0.957505372598163
10 0.795410174399195 0.723400299704323 -3.73227341951059 0.574740342890179
15 -0.0331420905999665 1.50953494759911 -0.643862090689209 -0.235674114998759
18 0.519226086066141 -1.13206482033264 -0.946451923900274 0.153459109295282
20 -1.13787844393218 -0.447229210945602 0.669928830364427 0.306483821831369
23 0.519226086066141 0.956038577607332 -0.649328546941474 -0.825039826251437
1 0.243041997733087 0.830295227878681 0.204892485389732 -0.205590257682169
4 -1.13787844393218 -1.57924450867301 0.122683543836927 1.080304476172
9 -0.861694355599128 3.08093421824054 0.794495651675079 1.2894482577175
12 1.3477783510653 0.0232126330286096 -0.372098909138089 -0.867874315202764
14 -1.13787844393218 3.77988804114414 0.62813244997315 0.744457517669959
17 -1.41406253226524 0.121200199618048 2.08086374773324 0.786060284561275
22 0.795410174399195 -0.215566553625099 -0.989679133035365 -0.169609921563112
25 1.62396243939836 1.25962718383174 0.134364427459238 0.787553912561683

This step has reduced the original data set from 25 observations with 7 predictors to 25 observations with 3 predictors (the 3 score variables principal_component_1, principal_component_2, and principal_component_3).