DecisionTree Example: Create Model - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product: Teradata Vantage
Release Number: 8.10; 1.1
Published: October 2019
Language: English (United States)
Last Update: 2019-12-31
dita:mapPath: ima1540829771750.ditamap
dita:ditavalPath: jsj1481748799576.ditaval
dita:id: B700-4003
lifecycle: previous
Product Category: Teradata Vantage™

The function DecisionTree acts on the training set to create the model. The DecisionTreePredict_MLE function uses that model and a test set to predict the output. The prediction accuracy is determined based on the original and prediction results.

Input

This example uses the iris data set (iris_input). The data has values for four attributes (sepal_length, sepal_width, petal_length and petal_width), which are grouped into three categories (setosa (1), versicolor (2), virginica (3)). From the raw data, the example creates training and test data sets.

Raw Data: iris_input
id	sepal_length	sepal_width	petal_length	petal_width	species
1	5.1	3.5	1.4	0.2	1
2	4.9	3	1.4	0.2	1
3	4.7	3.2	1.3	0.2	1
4	4.6	3.1	1.5	0.2	1
5	5	3.6	1.4	0.2	1
6	5.4	3.9	1.7	0.4	1
7	4.6	3.4	1.4	0.3	1
8	5	3.4	1.5	0.2	1
9	4.4	2.9	1.4	0.2	1
10	4.9	3.1	1.5	0.1	1
...	...	...	...	...	...

This code divides the 150 data rows into a training data set (80%) and a test data set (20%):

DROP TABLE iris_train;
DROP TABLE iris_test;

CREATE MULTISET TABLE iris_train AS (
  SELECT * FROM iris_input WHERE id MOD 5 <> 0
) WITH DATA;

CREATE MULTISET TABLE iris_test AS (
  SELECT * FROM iris_input WHERE id MOD 5 = 0
) WITH DATA;

Alternatively, you can do the preceding task with the Sampling or RandomSample function.

This query returns the following table:

SELECT * FROM iris_train ORDER BY id;

Training Data Set: iris_train
id	sepal_length	sepal_width	petal_length	petal_width	species
1	5.1	3.5	1.4	0.2	1
2	4.9	3	1.4	0.2	1
3	4.7	3.2	1.3	0.2	1
4	4.6	3.1	1.5	0.2	1
6	5.4	3.9	1.7	0.4	1
7	4.6	3.4	1.4	0.3	1
8	5	3.4	1.5	0.2	1
9	4.4	2.9	1.4	0.2	1
11	5.4	3.7	1.5	0.2	1
12	4.8	3.4	1.6	0.2	1
13	4.8	3	1.4	0.1	1
14	4.3	3	1.1	0.1	1
16	5.7	4.4	1.5	0.4	1
...	...	...	...	...	...

This query returns the following table:

SELECT * FROM iris_test ORDER BY id;

Test Data Set: iris_test
id	sepal_length	sepal_width	petal_length	petal_width	species
5	5	3.6	1.4	0.2	1
10	4.9	3.1	1.5	0.1	1
15	5.8	4	1.2	0.2	1
20	5.1	3.8	1.5	0.3	1
25	4.8	3.4	1.9	0.2	1
30	4.7	3.2	1.6	0.2	1
35	4.9	3.1	1.5	0.2	1
40	5.1	3.4	1.5	0.2	1
45	5.1	3.8	1.9	0.4	1
50	5	3.3	1.4	0.2	1
55	6.5	2.8	4.6	1.5	2
60	5.2	2.7	3.9	1.4	2
65	5.6	2.9	3.6	1.3	2
70	5.6	2.5	3.9	1.1	2
75	6.4	2.9	4.3	1.3	2
80	5.7	2.6	3.5	1	2
85	5.4	3	4.5	1.5	2
90	5.5	2.5	4	1.3	2
95	5.6	2.7	4.2	1.3	2
100	5.7	2.8	4.1	1.3	2
105	6.5	3	5.8	2.2	3
110	7.2	3.6	6.1	2.5	3
115	5.8	2.8	5.1	2.4	3
120	6	2.2	5	1.5	3
125	6.7	3.3	5.7	2.1	3
130	7.2	3	5.8	1.6	3
135	6.1	2.6	5.6	1.4	3
140	6.9	3.1	5.4	2.1	3
145	6.7	3.3	5.7	2.5	3
150	5.9	3	5.1	1.8	3

The attribute tables, iris_attribute_train and iris_attribute_test, are created from the raw train and test data, iris_train and iris_test. The recommended way to create the attribute tables is to use the Unpivoting function:

CREATE MULTISET TABLE singletree1 AS ( 
  SELECT * FROM Unpivoting (
    ON iris_train
    USING
    TargetColumns ('sepal_length', 'sepal_width', 'petal_length', 'petal_width')
    AttributeColumn ('attribute')
    ValueColumn ('value_col')
    InputTypes ('false')
    Accumulate ('id')
  ) AS dt
) WITH DATA;

CREATE MULTISET TABLE iris_attribute_train AS (
  SELECT id AS pid, attribute, value_col AS attrvalue FROM singletree1
) WITH DATA;

CREATE MULTISET TABLE singletree2 AS (
  SELECT * FROM Unpivoting (
    ON iris_test
    USING
    TargetColumns ('sepal_length', 'sepal_width', 'petal_length', 'petal_width')
    AttributeColumn ('attribute')
    ValueColumn ('value_col')
    InputTypes ('false')
    Accumulate ('id')
  ) AS dt
) WITH DATA;

CREATE MULTISET TABLE iris_attribute_test AS (
  SELECT id AS pid, attribute, value_col AS attrvalue FROM singletree2
) WITH DATA;

This query returns the following table:

SELECT * FROM iris_attribute_train ORDER BY pid, attribute;

Attribute Table: iris_attribute_train
pid	attribute	attrvalue
1	petal_length	1.4
1	petal_width	0.2
1	sepal_length	5.1
1	sepal_width	3.5
2	petal_length	1.4
2	petal_width	0.2
2	sepal_length	4.9
2	sepal_width	3
3	petal_length	1.3
3	petal_width	0.2
3	sepal_length	4.7
3	sepal_width	3.2
...	...	...

This query returns the following table:

SELECT * FROM iris_attribute_test ORDER BY pid, attribute;

Attribute Table: iris_attribute_test
pid	attribute	attrvalue
5	petal_length	1.4
5	petal_width	0.2
5	sepal_length	5
5	sepal_width	3.6
10	petal_length	1.5
10	petal_width	0.1
10	sepal_length	4.9
10	sepal_width	3.1
15	petal_length	1.2
15	petal_width	0.2
15	sepal_length	5.8
15	sepal_width	4
...	...	...

Response tables, created from the raw train and test data, are used as inputs.

DROP TABLE iris_response_train;
CREATE MULTISET TABLE iris_response_train (pid INTEGER, response VARCHAR);
INSERT INTO iris_response_train SELECT id, species FROM iris_train;

This query returns the following table:

SELECT * FROM iris_response_train ORDER BY pid;

Response Table: iris_response_train
pid	response
1	1
2	1
3	1
4	1
6	1
7	1
8	1
9	1
11	1
12	1
13	1
14	1
16	1
...	...

DROP TABLE iris_response_test;
CREATE MULTISET TABLE iris_response_test (pid INTEGER, response VARCHAR);
INSERT INTO iris_response_test SELECT id, species FROM iris_test;

This query returns the following table:

SELECT * FROM iris_response_test ORDER BY pid;

Response Table: iris_response_test
pid	response
5	1
10	1
15	1
20	1
25	1
30	1
35	1
40	1
45	1
50	1
55	2
60	2
65	2
70	2
75	2
80	2
85	2
90	2
95	2
100	2
105	3
110	3
115	3
120	3
125	3
130	3
135	3
140	3
145	3
150	3

SQL Call

DROP TABLE iris_attribute_output;
DROP TABLE splits_small;

SELECT * FROM DecisionTree (
  ON iris_attribute_train AS AttributeTable
  ON iris_response_train AS ResponseTable
  OUT TABLE OutputTable (iris_attribute_output)
  OUT TABLE IntermediateSplitsTable (splits_small)
  USING
  NumSplits (3)
  SplitMeasure ('gini')
  MaxDepth (10)
  IDColumns ('pid')
  AttributeNameColumns ('attribute')
  AttributeValueColumn ('attrvalue')
  ResponseColumn ('response')
  MinNodeSize (10)
  ApproxSplits ('false')
) AS dt;

Output

 message                                                                                   
 ----------------------------------------------------------------------------------------- 
 Output model table is successfully stored in the table specified in OutputTable argument.
 Depth of the tree is:5

SELECT * FROM iris_attribute_output;

 node_id node_size node_gini_p         node_entropy        node_chisq_pv node_label  node_majorvotes split_value        split_gini_p        split_entropy       split_chisq_pv         left_id left_size left_label  left_majorvotes right_id right_size right_label right_majorvotes left_bucket right_bucket attribute      
 ------- --------- ------------------- ------------------- ------------- ----------- --------------- ------------------ ------------------- ------------------- ---------------------- ------- --------- ----------- --------------- -------- ---------- ----------- ---------------- ----------- ------------ -------------- 
       2        90  0.5977777777777777  1.4900684346901674           1.0           2              40 1.7999999523162842  0.2459045406780598  0.6980234505326226 1.1102230246251565E-16       5        41           2              39        6         49           3               38                          petal_width   
       6        49  0.3740108288213244  1.0467992835555564           1.0           3              38  6.300000190734863  0.2924397031539888  0.7112635941634022  1.4851330869192214E-4      13        16           3               8       14         33           3               30                          sepal_length  
      11        38 0.05124653739612184 0.17556502585750336           1.0           2              37                5.5 0.04210526315789473 0.09499053880096872   0.009226867937210503      23         5           2               4       24         33           2               33                          sepal_length  
      30        24             0.21875  0.5435644431995974           1.0           3              21  3.200000047683716 0.19999999999999996  0.4512050593046013    0.15149399240422035      61        15           3              12       62          9           3                9                          sepal_width   
       0       130  0.7125443786982248  1.9220774796203766           1.0           1              40                1.0  0.4138461538461538   1.031585839400885                    0.0       1        40           1              40        2         90           2               40                          petal_width   
       5        41 0.09280190362879237  0.2811937964320433           1.0           2              39                5.0 0.08001711596063324 0.22991118255484383    0.01747194813144859      11        38           2              37       12          3           2                2                          petal_length  
      13        16           0.5546875  1.2717822215997983           1.0           3               8  3.200000047683716 0.20833333333333331  0.5548650349110991  0.0019848295804182348      27         7           3               7       28          9           5                7                          sepal_width   
      14        33  0.1652892561983471  0.4394969869215134           1.0           3              30 2.9000000953674316  0.1590909090909091  0.3953195950542527     0.2659534093128423      29         9           3               9       30         24           3               21                          sepal_width

SELECT * FROM splits_small;

 attribute      percentile         attrvalue          
 -------------- ------------------ ------------------ 
 sepal_length    66.66666666666667  6.300000190734863
 petal_length    66.66666666666667                5.0
 petal_width    33.333333333333336                1.0
 sepal_width    33.333333333333336 2.9000000953674316
 petal_width                 100.0                2.5
 sepal_width                 100.0  4.400000095367432
 sepal_length   33.333333333333336                5.5
 petal_length   33.333333333333336                3.5
 sepal_length                100.0  7.900000095367432
 petal_length                100.0  6.900000095367432
 petal_width     66.66666666666667 1.7999999523162842
 sepal_width     66.66666666666667  3.200000047683716

Download a zip file of all examples and a SQL script file that creates their input tables from the attachment in the left sidebar.