The function DecisionTree acts on the train set to create the model. The Single_Tree_Predict function uses that model and a test set to predict the output. The prediction accuracy is determined based on the original and prediction results.
Input
This example uses the iris data set (iris_input). The data has values for four attributes (sepal_length, sepal_width, petal_length and petal_width), which are grouped into three categories (setosa (1), versicolor (2), virginica (3)). From the raw data, the example creates training and test data sets.
id | sepal_length | sepal_width | petal_length | petal_width | species |
---|---|---|---|---|---|
1 | 5.1 | 3.5 | 1.4 | 0.2 | 1 |
2 | 4.9 | 3 | 1.4 | 0.2 | 1 |
3 | 4.7 | 3.2 | 1.3 | 0.2 | 1 |
4 | 4.6 | 3.1 | 1.5 | 0.2 | 1 |
5 | 5 | 3.6 | 1.4 | 0.2 | 1 |
6 | 5.4 | 3.9 | 1.7 | 0.4 | 1 |
7 | 4.6 | 3.4 | 1.4 | 0.3 | 1 |
8 | 5 | 3.4 | 1.5 | 0.2 | 1 |
9 | 4.4 | 2.9 | 1.4 | 0.2 | 1 |
10 | 4.9 | 3.1 | 1.5 | 0.1 | 1 |
... | ... | ... | ... | ... | ... |
This code divides the 150 data rows into a training data set (80%) and a test data set (20%):
DROP TABLE iris_train; DROP TABLE iris_test; CREATE MULTISET TABLE iris_train AS ( SELECT * FROM iris_input WHERE id%5!=0 ) WITH DATA; CREATE MULTISET TABLE iris_test AS ( SELECT * FROM iris_input WHERE id%5=0 ) WITH DATA;
This query returns the following table:
SELECT * FROM iris_train ORDER BY id;
id | sepal_length | sepal_width | petal_length | petal_width | species |
---|---|---|---|---|---|
1 | 5.1 | 3.5 | 1.4 | 0.2 | 1 |
2 | 4.9 | 3 | 1.4 | 0.2 | 1 |
3 | 4.7 | 3.2 | 1.3 | 0.2 | 1 |
4 | 4.6 | 3.1 | 1.5 | 0.2 | 1 |
6 | 5.4 | 3.9 | 1.7 | 0.4 | 1 |
7 | 4.6 | 3.4 | 1.4 | 0.3 | 1 |
8 | 5 | 3.4 | 1.5 | 0.2 | 1 |
9 | 4.4 | 2.9 | 1.4 | 0.2 | 1 |
11 | 5.4 | 3.7 | 1.5 | 0.2 | 1 |
12 | 4.8 | 3.4 | 1.6 | 0.2 | 1 |
13 | 4.8 | 3 | 1.4 | 0.1 | 1 |
14 | 4.3 | 3 | 1.1 | 0.1 | 1 |
16 | 5.7 | 4.4 | 1.5 | 0.4 | 1 |
... | ... | ... | ... | ... | ... |
This query returns the following table:
SELECT * FROM iris_test ORDER BY id;
id | sepal_length | sepal_width | petal_length | petal_width | species |
---|---|---|---|---|---|
5 | 5 | 3.6 | 1.4 | 0.2 | 1 |
10 | 4.9 | 3.1 | 1.5 | 0.1 | 1 |
15 | 5.8 | 4 | 1.2 | 0.2 | 1 |
20 | 5.1 | 3.8 | 1.5 | 0.3 | 1 |
25 | 4.8 | 3.4 | 1.9 | 0.2 | 1 |
30 | 4.7 | 3.2 | 1.6 | 0.2 | 1 |
35 | 4.9 | 3.1 | 1.5 | 0.2 | 1 |
40 | 5.1 | 3.4 | 1.5 | 0.2 | 1 |
45 | 5.1 | 3.8 | 1.9 | 0.4 | 1 |
50 | 5 | 3.3 | 1.4 | 0.2 | 1 |
55 | 6.5 | 2.8 | 4.6 | 1.5 | 2 |
60 | 5.2 | 2.7 | 3.9 | 1.4 | 2 |
65 | 5.6 | 2.9 | 3.6 | 1.3 | 2 |
70 | 5.6 | 2.5 | 3.9 | 1.1 | 2 |
75 | 6.4 | 2.9 | 4.3 | 1.3 | 2 |
80 | 5.7 | 2.6 | 3.5 | 1 | 2 |
85 | 5.4 | 3 | 4.5 | 1.5 | 2 |
90 | 5.5 | 2.5 | 4 | 1.3 | 2 |
95 | 5.6 | 2.7 | 4.2 | 1.3 | 2 |
100 | 5.7 | 2.8 | 4.1 | 1.3 | 2 |
105 | 6.5 | 3 | 5.8 | 2.2 | 3 |
110 | 7.2 | 3.6 | 6.1 | 2.5 | 3 |
115 | 5.8 | 2.8 | 5.1 | 2.4 | 3 |
120 | 6 | 2.2 | 5 | 1.5 | 3 |
125 | 6.7 | 3.3 | 5.7 | 2.1 | 3 |
130 | 7.2 | 3 | 5.8 | 1.6 | 3 |
135 | 6.1 | 2.6 | 5.6 | 1.4 | 3 |
140 | 6.9 | 3.1 | 5.4 | 2.1 | 3 |
145 | 6.7 | 3.3 | 5.7 | 2.5 | 3 |
150 | 5.9 | 3 | 5.1 | 1.8 | 3 |
The attribute tables, iris_attribute_train and iris_attribute_test, are created from the raw train and test data, iris_train and iris_test. The recommended way to create the attribute tables is to use the Unpivoting function:
CREATE MULTISET TABLE singletree1 AS ( SELECT * FROM Unpivoting ( ON iris_train USING TargetColumns ('sepal_length', 'sepal_width', 'petal_length', 'petal_width') AttributeColumn ('attribute') ValueColumn ('value_col') InputTypes ('false') Accumulate ('id') ) AS dt ) WITH DATA; CREATE MULTISET TABLE iris_attribute_train AS ( SELECT id AS pid, attribute, value_col AS attrvalue FROM singletree1 ) WITH DATA;
CREATE MULTISET TABLE singletree2 AS ( SELECT * FROM Unpivoting ( ON iris_test USING TargetColumns ('sepal_length', 'sepal_width', 'petal_length', 'petal_width') AttributeColumn ('attribute') ValueColumn ('value_col') InputTypes ('false') Accumulate ('id') ) AS dt ) WITH DATA; CREATE MULTISET TABLE iris_attribute_test AS ( SELECT id AS pid, attribute, value_col AS attrvalue FROM singletree2 ) WITH DATA;
This query returns the following table:
SELECT * FROM iris_attribute_train ORDER BY pid, attribute;
pid | attribute | attrvalue |
---|---|---|
1 | petal_length | 1.4 |
1 | petal_width | 0.2 |
1 | sepal_length | 5.1 |
1 | sepal_width | 3.5 |
2 | petal_length | 1.4 |
2 | petal_width | 0.2 |
2 | sepal_length | 4.9 |
2 | sepal_width | 3 |
3 | petal_length | 1.3 |
3 | petal_width | 0.2 |
3 | sepal_length | 4.7 |
3 | sepal_width | 3.2 |
... | ... | ... |
This query returns the following table:
SELECT * FROM iris_attribute_test ORDER BY pid, attribute;
pid | attribute | attrvalue |
---|---|---|
5 | petal_length | 1.4 |
5 | petal_width | 0.2 |
5 | sepal_length | 5 |
5 | sepal_width | 3.6 |
10 | petal_length | 1.5 |
10 | petal_width | 0.1 |
10 | sepal_length | 4.9 |
10 | sepal_width | 3.1 |
15 | petal_length | 1.2 |
15 | petal_width | 0.2 |
15 | sepal_length | 5.8 |
15 | sepal_width | 4 |
... | ... | ... |
Response tables, created from the raw train and test data, are used as inputs.
DROP TABLE iris_response_train; CREATE MULTISET TABLE iris_response_train (pid INTEGER, response VARCHAR); INSERT INTO iris_response_train SELECT id, species FROM iris_train;
This query returns the following table:
SELECT * FROM iris_response_train ORDER BY pid;
pid | response |
---|---|
1 | 1 |
2 | 1 |
3 | 1 |
4 | 1 |
6 | 1 |
7 | 1 |
8 | 1 |
9 | 1 |
11 | 1 |
12 | 1 |
13 | 1 |
14 | 1 |
16 | 1 |
... | ... |
DROP TABLE iris_response_test; CREATE MULTISET TABLE iris_response_test (pid INTEGER, response VARCHAR); INSERT INTO iris_response_test SELECT id, species FROM iris_test;
This query returns the following table:
SELECT * FROM iris_response_test ORDER BY pid;
pid | response |
---|---|
5 | 1 |
10 | 1 |
15 | 1 |
20 | 1 |
25 | 1 |
30 | 1 |
35 | 1 |
40 | 1 |
45 | 1 |
50 | 1 |
55 | 2 |
60 | 2 |
65 | 2 |
70 | 2 |
75 | 2 |
80 | 2 |
85 | 2 |
90 | 2 |
95 | 2 |
100 | 2 |
105 | 3 |
110 | 3 |
115 | 3 |
120 | 3 |
125 | 3 |
130 | 3 |
135 | 3 |
140 | 3 |
145 | 3 |
150 | 3 |
SQL Call
DROP TABLE iris_attribute_output; DROP TABLE splits_small; SELECT * FROM DecisionTree ( ON iris_attribute_train AS AttributeTableName ON iris_response_train AS ResponseTableName OUT TABLE OutputTable (iris_attribute_output) OUT TABLE IntermediateSplitsTable (splits_small) USING NumSplits (3) SplitMeasure ('gini') MaxDepth (10) IDColumns ('pid') AttributeNameColumns ('attribute') AttributeValueColumn ('attrvalue') ResponseColumn ('response') MinNodeSize (10) ApproxSplits ('false') ) AS dt;
Output
message |
---|
Output model table is successfully stored in the table specified in OutputTable argument. Depth of the tree is:6 |
This query returns the following table:
SELECT * FROM iris_attribute_output ORDER BY 1;
node_id | node_size | node_gini_p | node_entropy | node_chisq_pv | node_label | node_majorvotes | split_value | split_gini_p | split_entropy | split_chisq_pv | left_id | left_size | left_label | left_majorvotes | right_id | right_size | right_label | right_majorvotes | left_bucket | right_bucket | attribute |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 120 | 0.666666666666667 | 1.58496250072116 | 1 | 1 | 40 | 3 | 0.333333333333333 | 0.666666666666666 | 0 | 1 | 40 | 1 | 40 | 2 | 80 | 2 | 40 | petal_length | ||
2 | 80 | 0.5 | 1 | 1 | 2 | 40 | 1.70000004768372 | 0.0719199499687304 | 0.227979833481065 | 1.11022302462516e-16 | 5 | 39 | 2 | 38 | 6 | 41 | 3 | 39 | petal_width | ||
5 | 39 | 0.0499671268902038 | 0.172036949353113 | 1 | 2 | 38 | 4.90000009536743 | 0.0384615384615385 | 0.0832080127650393 | 0.00272911340977045 | 11 | 35 | 2 | 35 | 12 | 4 | 2 | 3 | petal_length | ||
6 | 41 | 0.0928019036287924 | 0.281193796432043 | 1 | 3 | 39 | 4.90000009536743 | 0.0840474620962426 | 0.240916755467913 | 0.0492232443463754 | 13 | 4 | 3 | 3 | 14 | 37 | 3 | 36 | petal_length | ||
14 | 37 | 0.0525931336742148 | 0.179256066928321 | 1 | 3 | 36 | 2.90000009536743 | 0.0518018018018018 | 0.162085811567472 | 0.455587679837851 | 29 | 13 | 3 | 13 | 30 | 24 | 3 | 23 | sepal_width | ||
30 | 24 | 0.0798611111111112 | 0.249882292833186 | 1 | 3 | 23 | 3.20000004768372 | 0.0773809523809524 | 0.216552190540511 | 0.387955122826146 | 61 | 14 | 3 | 13 | 62 | 10 | 3 | 10 | sepal_width | ||
61 | 14 | 0.13265306122449 | 0.371232326640875 | 1 | 3 | 13 | 6.30000019073486 | 0.131868131868132 | 0.363297594798595 | 0.773484680980946 | 123 | 1 | 3 | 1 | 124 | 13 | 3 | 12 | sepal_length |