Before generating the decision tree, the decisiontree function generates SQL statements that return statistics about the attributes and predicted variable. From these statistics, the algorithm does the following:
- Determines the cardinality of each attribute
- Gets all possible values of the predicted variable and the counts associated with it from all observations
- Initializes structures in memory for later use in the building process
The SQL statement that drives the tree-building process builds a contingency table from the data. The contingency table is an mxn matrix. Its m rows correspond to the distinct values of an attribute. Its n columns correspond to the distinct values of the predicted variable.
The SQL statement queries the contingency table to get statistics for calculations. The query consists of the counts of the N distinct values of the dependent variable. Therefore, when building a contingency table on a subset of the data in the input table, the SQL statement includes a WHERE clause that defines the subset. The subset is the path down the tree that defines which node is a candidate to split.