The function DecisionTree acts on the training set to create the model. The DecisionTreePredict_MLE function uses that model and a test set to predict the output. The prediction accuracy is determined based on the original and prediction results.
Input
This example uses the iris data set (iris_input). The data has values for four attributes (sepal_length, sepal_width, petal_length and petal_width), which are grouped into three categories (setosa (1), versicolor (2), virginica (3)). From the raw data, the example creates training and test data sets.
id | sepal_length | sepal_width | petal_length | petal_width | species |
---|---|---|---|---|---|
1 | 5.1 | 3.5 | 1.4 | 0.2 | 1 |
2 | 4.9 | 3 | 1.4 | 0.2 | 1 |
3 | 4.7 | 3.2 | 1.3 | 0.2 | 1 |
4 | 4.6 | 3.1 | 1.5 | 0.2 | 1 |
5 | 5 | 3.6 | 1.4 | 0.2 | 1 |
6 | 5.4 | 3.9 | 1.7 | 0.4 | 1 |
7 | 4.6 | 3.4 | 1.4 | 0.3 | 1 |
8 | 5 | 3.4 | 1.5 | 0.2 | 1 |
9 | 4.4 | 2.9 | 1.4 | 0.2 | 1 |
10 | 4.9 | 3.1 | 1.5 | 0.1 | 1 |
... | ... | ... | ... | ... | ... |
This code divides the 150 data rows into a training data set (80%) and a test data set (20%):
DROP TABLE iris_train; DROP TABLE iris_test; CREATE MULTISET TABLE iris_train AS ( SELECT * FROM iris_input WHERE id MOD 5 <> 0 ) WITH DATA; CREATE MULTISET TABLE iris_test AS ( SELECT * FROM iris_input WHERE id MOD 5 = 0 ) WITH DATA;
Alternatively, you can do the preceding task with the Sampling or RandomSample function.
This query returns the following table:
SELECT * FROM iris_train ORDER BY id;
id | sepal_length | sepal_width | petal_length | petal_width | species |
---|---|---|---|---|---|
1 | 5.1 | 3.5 | 1.4 | 0.2 | 1 |
2 | 4.9 | 3 | 1.4 | 0.2 | 1 |
3 | 4.7 | 3.2 | 1.3 | 0.2 | 1 |
4 | 4.6 | 3.1 | 1.5 | 0.2 | 1 |
6 | 5.4 | 3.9 | 1.7 | 0.4 | 1 |
7 | 4.6 | 3.4 | 1.4 | 0.3 | 1 |
8 | 5 | 3.4 | 1.5 | 0.2 | 1 |
9 | 4.4 | 2.9 | 1.4 | 0.2 | 1 |
11 | 5.4 | 3.7 | 1.5 | 0.2 | 1 |
12 | 4.8 | 3.4 | 1.6 | 0.2 | 1 |
13 | 4.8 | 3 | 1.4 | 0.1 | 1 |
14 | 4.3 | 3 | 1.1 | 0.1 | 1 |
16 | 5.7 | 4.4 | 1.5 | 0.4 | 1 |
... | ... | ... | ... | ... | ... |
This query returns the following table:
SELECT * FROM iris_test ORDER BY id;
id | sepal_length | sepal_width | petal_length | petal_width | species |
---|---|---|---|---|---|
5 | 5 | 3.6 | 1.4 | 0.2 | 1 |
10 | 4.9 | 3.1 | 1.5 | 0.1 | 1 |
15 | 5.8 | 4 | 1.2 | 0.2 | 1 |
20 | 5.1 | 3.8 | 1.5 | 0.3 | 1 |
25 | 4.8 | 3.4 | 1.9 | 0.2 | 1 |
30 | 4.7 | 3.2 | 1.6 | 0.2 | 1 |
35 | 4.9 | 3.1 | 1.5 | 0.2 | 1 |
40 | 5.1 | 3.4 | 1.5 | 0.2 | 1 |
45 | 5.1 | 3.8 | 1.9 | 0.4 | 1 |
50 | 5 | 3.3 | 1.4 | 0.2 | 1 |
55 | 6.5 | 2.8 | 4.6 | 1.5 | 2 |
60 | 5.2 | 2.7 | 3.9 | 1.4 | 2 |
65 | 5.6 | 2.9 | 3.6 | 1.3 | 2 |
70 | 5.6 | 2.5 | 3.9 | 1.1 | 2 |
75 | 6.4 | 2.9 | 4.3 | 1.3 | 2 |
80 | 5.7 | 2.6 | 3.5 | 1 | 2 |
85 | 5.4 | 3 | 4.5 | 1.5 | 2 |
90 | 5.5 | 2.5 | 4 | 1.3 | 2 |
95 | 5.6 | 2.7 | 4.2 | 1.3 | 2 |
100 | 5.7 | 2.8 | 4.1 | 1.3 | 2 |
105 | 6.5 | 3 | 5.8 | 2.2 | 3 |
110 | 7.2 | 3.6 | 6.1 | 2.5 | 3 |
115 | 5.8 | 2.8 | 5.1 | 2.4 | 3 |
120 | 6 | 2.2 | 5 | 1.5 | 3 |
125 | 6.7 | 3.3 | 5.7 | 2.1 | 3 |
130 | 7.2 | 3 | 5.8 | 1.6 | 3 |
135 | 6.1 | 2.6 | 5.6 | 1.4 | 3 |
140 | 6.9 | 3.1 | 5.4 | 2.1 | 3 |
145 | 6.7 | 3.3 | 5.7 | 2.5 | 3 |
150 | 5.9 | 3 | 5.1 | 1.8 | 3 |
The attribute tables, iris_attribute_train and iris_attribute_test, are created from the raw train and test data, iris_train and iris_test. The recommended way to create the attribute tables is to use the Unpivoting function:
CREATE MULTISET TABLE singletree1 AS ( SELECT * FROM Unpivoting ( ON iris_train USING TargetColumns ('sepal_length', 'sepal_width', 'petal_length', 'petal_width') AttributeColumn ('attribute') ValueColumn ('value_col') InputTypes ('false') Accumulate ('id') ) AS dt ) WITH DATA; CREATE MULTISET TABLE iris_attribute_train AS ( SELECT id AS pid, attribute, value_col AS attrvalue FROM singletree1 ) WITH DATA;
CREATE MULTISET TABLE singletree2 AS ( SELECT * FROM Unpivoting ( ON iris_test USING TargetColumns ('sepal_length', 'sepal_width', 'petal_length', 'petal_width') AttributeColumn ('attribute') ValueColumn ('value_col') InputTypes ('false') Accumulate ('id') ) AS dt ) WITH DATA; CREATE MULTISET TABLE iris_attribute_test AS ( SELECT id AS pid, attribute, value_col AS attrvalue FROM singletree2 ) WITH DATA;
This query returns the following table:
SELECT * FROM iris_attribute_train ORDER BY pid, attribute;
pid | attribute | attrvalue |
---|---|---|
1 | petal_length | 1.4 |
1 | petal_width | 0.2 |
1 | sepal_length | 5.1 |
1 | sepal_width | 3.5 |
2 | petal_length | 1.4 |
2 | petal_width | 0.2 |
2 | sepal_length | 4.9 |
2 | sepal_width | 3 |
3 | petal_length | 1.3 |
3 | petal_width | 0.2 |
3 | sepal_length | 4.7 |
3 | sepal_width | 3.2 |
... | ... | ... |
This query returns the following table:
SELECT * FROM iris_attribute_test ORDER BY pid, attribute;
pid | attribute | attrvalue |
---|---|---|
5 | petal_length | 1.4 |
5 | petal_width | 0.2 |
5 | sepal_length | 5 |
5 | sepal_width | 3.6 |
10 | petal_length | 1.5 |
10 | petal_width | 0.1 |
10 | sepal_length | 4.9 |
10 | sepal_width | 3.1 |
15 | petal_length | 1.2 |
15 | petal_width | 0.2 |
15 | sepal_length | 5.8 |
15 | sepal_width | 4 |
... | ... | ... |
Response tables, created from the raw train and test data, are used as inputs.
DROP TABLE iris_response_train; CREATE MULTISET TABLE iris_response_train (pid INTEGER, response VARCHAR); INSERT INTO iris_response_train SELECT id, species FROM iris_train;
This query returns the following table:
SELECT * FROM iris_response_train ORDER BY pid;
pid | response |
---|---|
1 | 1 |
2 | 1 |
3 | 1 |
4 | 1 |
6 | 1 |
7 | 1 |
8 | 1 |
9 | 1 |
11 | 1 |
12 | 1 |
13 | 1 |
14 | 1 |
16 | 1 |
... | ... |
DROP TABLE iris_response_test; CREATE MULTISET TABLE iris_response_test (pid INTEGER, response VARCHAR); INSERT INTO iris_response_test SELECT id, species FROM iris_test;
This query returns the following table:
SELECT * FROM iris_response_test ORDER BY pid;
pid | response |
---|---|
5 | 1 |
10 | 1 |
15 | 1 |
20 | 1 |
25 | 1 |
30 | 1 |
35 | 1 |
40 | 1 |
45 | 1 |
50 | 1 |
55 | 2 |
60 | 2 |
65 | 2 |
70 | 2 |
75 | 2 |
80 | 2 |
85 | 2 |
90 | 2 |
95 | 2 |
100 | 2 |
105 | 3 |
110 | 3 |
115 | 3 |
120 | 3 |
125 | 3 |
130 | 3 |
135 | 3 |
140 | 3 |
145 | 3 |
150 | 3 |
SQL Call
DROP TABLE iris_attribute_output; DROP TABLE splits_small; SELECT * FROM DecisionTree ( ON iris_attribute_train AS AttributeTable ON iris_response_train AS ResponseTable OUT TABLE OutputTable (iris_attribute_output) OUT TABLE IntermediateSplitsTable (splits_small) USING NumSplits (3) SplitMeasure ('gini') MaxDepth (10) IDColumns ('pid') AttributeNameColumns ('attribute') AttributeValueColumn ('attrvalue') ResponseColumn ('response') MinNodeSize (10) ApproxSplits ('false') ) AS dt;
Output
message ----------------------------------------------------------------------------------------- Output model table is successfully stored in the table specified in OutputTable argument. Depth of the tree is:5
SELECT * FROM iris_attribute_output;
node_id node_size node_gini_p node_entropy node_chisq_pv node_label node_majorvotes split_value split_gini_p split_entropy split_chisq_pv left_id left_size left_label left_majorvotes right_id right_size right_label right_majorvotes left_bucket right_bucket attribute ------- --------- ------------------- ------------------- ------------- ----------- --------------- ------------------ ------------------- ------------------- ---------------------- ------- --------- ----------- --------------- -------- ---------- ----------- ---------------- ----------- ------------ -------------- 2 90 0.5977777777777777 1.4900684346901674 1.0 2 40 1.7999999523162842 0.2459045406780598 0.6980234505326226 1.1102230246251565E-16 5 41 2 39 6 49 3 38 petal_width 6 49 0.3740108288213244 1.0467992835555564 1.0 3 38 6.300000190734863 0.2924397031539888 0.7112635941634022 1.4851330869192214E-4 13 16 3 8 14 33 3 30 sepal_length 11 38 0.05124653739612184 0.17556502585750336 1.0 2 37 5.5 0.04210526315789473 0.09499053880096872 0.009226867937210503 23 5 2 4 24 33 2 33 sepal_length 30 24 0.21875 0.5435644431995974 1.0 3 21 3.200000047683716 0.19999999999999996 0.4512050593046013 0.15149399240422035 61 15 3 12 62 9 3 9 sepal_width 0 130 0.7125443786982248 1.9220774796203766 1.0 1 40 1.0 0.4138461538461538 1.031585839400885 0.0 1 40 1 40 2 90 2 40 petal_width 5 41 0.09280190362879237 0.2811937964320433 1.0 2 39 5.0 0.08001711596063324 0.22991118255484383 0.01747194813144859 11 38 2 37 12 3 2 2 petal_length 13 16 0.5546875 1.2717822215997983 1.0 3 8 3.200000047683716 0.20833333333333331 0.5548650349110991 0.0019848295804182348 27 7 3 7 28 9 5 7 sepal_width 14 33 0.1652892561983471 0.4394969869215134 1.0 3 30 2.9000000953674316 0.1590909090909091 0.3953195950542527 0.2659534093128423 29 9 3 9 30 24 3 21 sepal_width
SELECT * FROM splits_small;
attribute percentile attrvalue -------------- ------------------ ------------------ sepal_length 66.66666666666667 6.300000190734863 petal_length 66.66666666666667 5.0 petal_width 33.333333333333336 1.0 sepal_width 33.333333333333336 2.9000000953674316 petal_width 100.0 2.5 sepal_width 100.0 4.400000095367432 sepal_length 33.333333333333336 5.5 petal_length 33.333333333333336 3.5 sepal_length 100.0 7.900000095367432 petal_length 100.0 6.900000095367432 petal_width 66.66666666666667 1.7999999523162842 sepal_width 66.66666666666667 3.200000047683716
Download a zip file of all examples and a SQL script file that creates their input tables from the attachment in the left sidebar.