Step 1: Create Smaller Data Set

Step 1: Create Smaller Data Set - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product

Teradata Vantage

Release Number

8.00

1.0

Published

May 2019

Language

English (United States)

Last Update

2019-11-22

dita:mapPath

blj1506016597986.ditamap

dita:ditavalPath

blj1506016597986.ditaval

dita:id

B700-4003

lifecycle

Product Category

Teradata Vantage™

Input

The following statement creates the input table, pca_health_ev2, from the table pca_scaled (created in PCA Example), using only the variables age, bmi, bloodpressure, glucose, cigarettes, insulin, and hdl:

CREATE MULTISET TABLE pca_health_ev2 AS (
  SELECT * FROM PCAReduce (
    ON PCAMap (
      ON pca_scaled
      USING
      TargetColumns ('age','bmi','bloodpressure','glucose','cigarettes', 'insulin','hdl')
    ) PARTITION BY 1
  ) AS dt2
) WITH DATA;

This query returns the following table:

SELECT * FROM pca_health_ev2;

pca_health_ev2
component_rank	age	bmi	bloodpressure	glucose	cigarettes	insulin	hdl	sd	var_proportion	cumulative_var	mean
1	0.558383429424359	-0.0605952043265538	0.279191393011342	0.562811446671035	0.156485831353259	0.497004566613813	0.135389267935288	1.42680503704651	0.302942353235311	0.302942353235311	[-2.753353101070388E-16, 8.881784197001253E-17, 1.0658141036401502E-16, -7.993605777301127E-17, -1.3322676295501878E-17, 7.771561172376097E-18, -1.7763568394002505E-17]
2	-0.250537483690061	0.614686329749836	-0.258383085318548	-0.0474790977051386	0.579594554667163	0.305498725264089	0.247217076371136	1.27431560304133	0.241648847642051	0.544591200877363
3	-0.0911563605166551	-0.141748695621042	0.251924555111245	-0.220162314985357	0.392249448405349	0.284736376332311	-0.790396081112643	1.01834410187873	0.154319153248689	0.698910354126052
4	-0.0965156342595097	0.203200180598669	0.843333872842729	-0.0927285011581401	0.201301142840861	-0.36378506450842	0.238164018417497	0.92244700110317	0.126623284203011	0.825533638329062
5	-0.180835348784396	0.599972991222924	0.079755498943645	0.485460834090224	-0.419465995079716	-0.0691208326400593	-0.42961426285989	0.780266259616559	0.0905975351035737	0.916131173432636
6	0.741419465051456	0.431342026006021	-0.0814633949974965	-0.452698251899937	-0.0606741267486036	-0.16172687758833	-0.151103636140504	0.573696298386201	0.0489772980330401	0.965108471465676
7	-0.159404889398745	0.105272966756617	0.260029076069335	-0.428148411420131	-0.51533240361692	0.644444986031744	0.178058065989532	0.484222130587456	0.0348915285343237	1

SQL Call

This query uses the PCAScore function to compute the first three score variables, which correspond to the first three principal components, accumulating the response variable from the original input data table patient_pca_input:

CREATE MULTISET TABLE pcr1 AS (
  SELECT * FROM PCAScore (
    ON pca_scaled AS InputTable PARTITION BY ANY
    ON pca_health_ev2 AS pca_table DIMENSION ORDER BY component_rank
    USING
    Accumulate ('pid', 'strokes')
    Components (3)
  ) AS dt
) WITH DATA;

Output

pcr1
pid	strokes	principal_component_1	principal_component_2	principal_component_3
3	0.795410174399195	0.195110284571174	-1.17825238356938	-1.04290585569861
6	-0.0331420905999665	-1.02505176519817	-1.2375999732885	0.257874309966306
8	1.3477783510653	-1.98240922334734	0.159007920294922	-0.494890984515023
11	-0.30932617893302	-1.00884248343362	-0.597930103709463	0.301060051568942
13	1.3477783510653	1.091766978011	-1.16460787259922	-2.17662153529429
16	0.519226086066141	-1.93840477808707	-0.0623801090526926	-0.936238178663301
19	-1.13787844393218	-1.14924574475721	1.7169286389402	0.588543776698959
21	-0.585510267266074	0.144749031593665	1.56905010871958	0.52484021423205
24	1.07159426273225	-0.65879037722031	0.120392373582135	0.915771821734939
2	-1.13787844393218	-1.27086507768138	-0.0479281470754135	0.732287038132982
5	-1.41406253226524	0.308542873080395	2.70938975527651	-3.04594531776213
7	-0.585510267266074	-1.61658595260732	0.71226267926454	0.957505372598163
10	0.795410174399195	0.723400299704323	-3.73227341951059	0.574740342890179
15	-0.0331420905999665	1.50953494759911	-0.643862090689209	-0.235674114998759
18	0.519226086066141	-1.13206482033264	-0.946451923900274	0.153459109295282
20	-1.13787844393218	-0.447229210945602	0.669928830364427	0.306483821831369
23	0.519226086066141	0.956038577607332	-0.649328546941474	-0.825039826251437
1	0.243041997733087	0.830295227878681	0.204892485389732	-0.205590257682169
4	-1.13787844393218	-1.57924450867301	0.122683543836927	1.080304476172
9	-0.861694355599128	3.08093421824054	0.794495651675079	1.2894482577175
12	1.3477783510653	0.0232126330286096	-0.372098909138089	-0.867874315202764
14	-1.13787844393218	3.77988804114414	0.62813244997315	0.744457517669959
17	-1.41406253226524	0.121200199618048	2.08086374773324	0.786060284561275
22	0.795410174399195	-0.215566553625099	-0.989679133035365	-0.169609921563112
25	1.62396243939836	1.25962718383174	0.134364427459238	0.787553912561683

This step has reduced the original data set from 25 observations with 7 predictors to 25 observations with 3 predictors (the 3 score variables principal_component_1, principal_component_2, and principal_component_3).