DecisionTree Example: Create Model - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
8.10
1.1
Published
October 2019
Language
English (United States)
Last Update
2019-12-31
dita:mapPath
ima1540829771750.ditamap
dita:ditavalPath
jsj1481748799576.ditaval
dita:id
B700-4003
lifecycle
previous
Product Category
Teradata Vantageā„¢

The function DecisionTree acts on the training set to create the model. The DecisionTreePredict_MLE function uses that model and a test set to predict the output. The prediction accuracy is determined based on the original and prediction results.

Input

This example uses the iris data set (iris_input). The data has values for four attributes (sepal_length, sepal_width, petal_length and petal_width), which are grouped into three categories (setosa (1), versicolor (2), virginica (3)). From the raw data, the example creates training and test data sets.

Raw Data: iris_input
id sepal_length sepal_width petal_length petal_width species
1 5.1 3.5 1.4 0.2 1
2 4.9 3 1.4 0.2 1
3 4.7 3.2 1.3 0.2 1
4 4.6 3.1 1.5 0.2 1
5 5 3.6 1.4 0.2 1
6 5.4 3.9 1.7 0.4 1
7 4.6 3.4 1.4 0.3 1
8 5 3.4 1.5 0.2 1
9 4.4 2.9 1.4 0.2 1
10 4.9 3.1 1.5 0.1 1
... ... ... ... ... ...

This code divides the 150 data rows into a training data set (80%) and a test data set (20%):

DROP TABLE iris_train;
DROP TABLE iris_test;

CREATE MULTISET TABLE iris_train AS (
  SELECT * FROM iris_input WHERE id MOD 5 <> 0
) WITH DATA;

CREATE MULTISET TABLE iris_test AS (
  SELECT * FROM iris_input WHERE id MOD 5 = 0
) WITH DATA;

Alternatively, you can do the preceding task with the Sampling or RandomSample function.

This query returns the following table:

SELECT * FROM iris_train ORDER BY id;
Training Data Set: iris_train
id sepal_length sepal_width petal_length petal_width species
1 5.1 3.5 1.4 0.2 1
2 4.9 3 1.4 0.2 1
3 4.7 3.2 1.3 0.2 1
4 4.6 3.1 1.5 0.2 1
6 5.4 3.9 1.7 0.4 1
7 4.6 3.4 1.4 0.3 1
8 5 3.4 1.5 0.2 1
9 4.4 2.9 1.4 0.2 1
11 5.4 3.7 1.5 0.2 1
12 4.8 3.4 1.6 0.2 1
13 4.8 3 1.4 0.1 1
14 4.3 3 1.1 0.1 1
16 5.7 4.4 1.5 0.4 1
... ... ... ... ... ...

This query returns the following table:

SELECT * FROM iris_test ORDER BY id;
Test Data Set: iris_test
id sepal_length sepal_width petal_length petal_width species
5 5 3.6 1.4 0.2 1
10 4.9 3.1 1.5 0.1 1
15 5.8 4 1.2 0.2 1
20 5.1 3.8 1.5 0.3 1
25 4.8 3.4 1.9 0.2 1
30 4.7 3.2 1.6 0.2 1
35 4.9 3.1 1.5 0.2 1
40 5.1 3.4 1.5 0.2 1
45 5.1 3.8 1.9 0.4 1
50 5 3.3 1.4 0.2 1
55 6.5 2.8 4.6 1.5 2
60 5.2 2.7 3.9 1.4 2
65 5.6 2.9 3.6 1.3 2
70 5.6 2.5 3.9 1.1 2
75 6.4 2.9 4.3 1.3 2
80 5.7 2.6 3.5 1 2
85 5.4 3 4.5 1.5 2
90 5.5 2.5 4 1.3 2
95 5.6 2.7 4.2 1.3 2
100 5.7 2.8 4.1 1.3 2
105 6.5 3 5.8 2.2 3
110 7.2 3.6 6.1 2.5 3
115 5.8 2.8 5.1 2.4 3
120 6 2.2 5 1.5 3
125 6.7 3.3 5.7 2.1 3
130 7.2 3 5.8 1.6 3
135 6.1 2.6 5.6 1.4 3
140 6.9 3.1 5.4 2.1 3
145 6.7 3.3 5.7 2.5 3
150 5.9 3 5.1 1.8 3

The attribute tables, iris_attribute_train and iris_attribute_test, are created from the raw train and test data, iris_train and iris_test. The recommended way to create the attribute tables is to use the Unpivoting function:

CREATE MULTISET TABLE singletree1 AS ( 
  SELECT * FROM Unpivoting (
    ON iris_train
    USING
    TargetColumns ('sepal_length', 'sepal_width', 'petal_length', 'petal_width')
    AttributeColumn ('attribute')
    ValueColumn ('value_col')
    InputTypes ('false')
    Accumulate ('id')
  ) AS dt
) WITH DATA;

CREATE MULTISET TABLE iris_attribute_train AS (
  SELECT id AS pid, attribute, value_col AS attrvalue FROM singletree1
) WITH DATA;
CREATE MULTISET TABLE singletree2 AS (
  SELECT * FROM Unpivoting (
    ON iris_test
    USING
    TargetColumns ('sepal_length', 'sepal_width', 'petal_length', 'petal_width')
    AttributeColumn ('attribute')
    ValueColumn ('value_col')
    InputTypes ('false')
    Accumulate ('id')
  ) AS dt
) WITH DATA;

CREATE MULTISET TABLE iris_attribute_test AS (
  SELECT id AS pid, attribute, value_col AS attrvalue FROM singletree2
) WITH DATA;

This query returns the following table:

SELECT * FROM iris_attribute_train ORDER BY pid, attribute;
Attribute Table: iris_attribute_train
pid attribute attrvalue
1 petal_length 1.4
1 petal_width 0.2
1 sepal_length 5.1
1 sepal_width 3.5
2 petal_length 1.4
2 petal_width 0.2
2 sepal_length 4.9
2 sepal_width 3
3 petal_length 1.3
3 petal_width 0.2
3 sepal_length 4.7
3 sepal_width 3.2
... ... ...

This query returns the following table:

SELECT * FROM iris_attribute_test ORDER BY pid, attribute;
Attribute Table: iris_attribute_test
pid attribute attrvalue
5 petal_length 1.4
5 petal_width 0.2
5 sepal_length 5
5 sepal_width 3.6
10 petal_length 1.5
10 petal_width 0.1
10 sepal_length 4.9
10 sepal_width 3.1
15 petal_length 1.2
15 petal_width 0.2
15 sepal_length 5.8
15 sepal_width 4
... ... ...

Response tables, created from the raw train and test data, are used as inputs.

DROP TABLE iris_response_train;
CREATE MULTISET TABLE iris_response_train (pid INTEGER, response VARCHAR);
INSERT INTO iris_response_train SELECT id, species FROM iris_train;

This query returns the following table:

SELECT * FROM iris_response_train ORDER BY pid;
Response Table: iris_response_train
pid response
1 1
2 1
3 1
4 1
6 1
7 1
8 1
9 1
11 1
12 1
13 1
14 1
16 1
... ...
DROP TABLE iris_response_test;
CREATE MULTISET TABLE iris_response_test (pid INTEGER, response VARCHAR);
INSERT INTO iris_response_test SELECT id, species FROM iris_test;

This query returns the following table:

SELECT * FROM iris_response_test ORDER BY pid;
Response Table: iris_response_test
pid response
5 1
10 1
15 1
20 1
25 1
30 1
35 1
40 1
45 1
50 1
55 2
60 2
65 2
70 2
75 2
80 2
85 2
90 2
95 2
100 2
105 3
110 3
115 3
120 3
125 3
130 3
135 3
140 3
145 3
150 3

SQL Call

DROP TABLE iris_attribute_output;
DROP TABLE splits_small;

SELECT * FROM DecisionTree (
  ON iris_attribute_train AS AttributeTable
  ON iris_response_train AS ResponseTable
  OUT TABLE OutputTable (iris_attribute_output)
  OUT TABLE IntermediateSplitsTable (splits_small)
  USING
  NumSplits (3)
  SplitMeasure ('gini')
  MaxDepth (10)
  IDColumns ('pid')
  AttributeNameColumns ('attribute')
  AttributeValueColumn ('attrvalue')
  ResponseColumn ('response')
  MinNodeSize (10)
  ApproxSplits ('false')
) AS dt;

Output

 message                                                                                   
 ----------------------------------------------------------------------------------------- 
 Output model table is successfully stored in the table specified in OutputTable argument.
 Depth of the tree is:5
SELECT * FROM iris_attribute_output;
 node_id node_size node_gini_p         node_entropy        node_chisq_pv node_label  node_majorvotes split_value        split_gini_p        split_entropy       split_chisq_pv         left_id left_size left_label  left_majorvotes right_id right_size right_label right_majorvotes left_bucket right_bucket attribute      
 ------- --------- ------------------- ------------------- ------------- ----------- --------------- ------------------ ------------------- ------------------- ---------------------- ------- --------- ----------- --------------- -------- ---------- ----------- ---------------- ----------- ------------ -------------- 
       2        90  0.5977777777777777  1.4900684346901674           1.0           2              40 1.7999999523162842  0.2459045406780598  0.6980234505326226 1.1102230246251565E-16       5        41           2              39        6         49           3               38                          petal_width   
       6        49  0.3740108288213244  1.0467992835555564           1.0           3              38  6.300000190734863  0.2924397031539888  0.7112635941634022  1.4851330869192214E-4      13        16           3               8       14         33           3               30                          sepal_length  
      11        38 0.05124653739612184 0.17556502585750336           1.0           2              37                5.5 0.04210526315789473 0.09499053880096872   0.009226867937210503      23         5           2               4       24         33           2               33                          sepal_length  
      30        24             0.21875  0.5435644431995974           1.0           3              21  3.200000047683716 0.19999999999999996  0.4512050593046013    0.15149399240422035      61        15           3              12       62          9           3                9                          sepal_width   
       0       130  0.7125443786982248  1.9220774796203766           1.0           1              40                1.0  0.4138461538461538   1.031585839400885                    0.0       1        40           1              40        2         90           2               40                          petal_width   
       5        41 0.09280190362879237  0.2811937964320433           1.0           2              39                5.0 0.08001711596063324 0.22991118255484383    0.01747194813144859      11        38           2              37       12          3           2                2                          petal_length  
      13        16           0.5546875  1.2717822215997983           1.0           3               8  3.200000047683716 0.20833333333333331  0.5548650349110991  0.0019848295804182348      27         7           3               7       28          9           5                7                          sepal_width   
      14        33  0.1652892561983471  0.4394969869215134           1.0           3              30 2.9000000953674316  0.1590909090909091  0.3953195950542527     0.2659534093128423      29         9           3               9       30         24           3               21                          sepal_width
SELECT * FROM splits_small;
 attribute      percentile         attrvalue          
 -------------- ------------------ ------------------ 
 sepal_length    66.66666666666667  6.300000190734863
 petal_length    66.66666666666667                5.0
 petal_width    33.333333333333336                1.0
 sepal_width    33.333333333333336 2.9000000953674316
 petal_width                 100.0                2.5
 sepal_width                 100.0  4.400000095367432
 sepal_length   33.333333333333336                5.5
 petal_length   33.333333333333336                3.5
 sepal_length                100.0  7.900000095367432
 petal_length                100.0  6.900000095367432
 petal_width     66.66666666666667 1.7999999523162842
 sepal_width     66.66666666666667  3.200000047683716

Download a zip file of all examples and a SQL script file that creates their input tables from the attachment in the left sidebar.