Output Table Schema
Column | Data Type | Description |
---|---|---|
worker_ip | VARCHAR | IP address of worker that produced decision tree. |
task_index | INTEGER | Identifier of worker that produced decision tree. |
tree_num | INTEGER | Decision tree identifier. |
variable_col | VARCHAR | Variable name. |
level | INTEGER | Highest level of decision tree at which variable appears. |
cnt | INTEGER | Number of times variable is used as split node in decision tree. |
importance | DOUBLE PRECISION | Importance statistics for each decision tree. To calculate overall importance of each variable, you must group by variable and then take average over all trees. Use this query, where n is number of trees:SELECT variable, sum(importance)/n FROM DecisionForestEvaluator ( ON { table | view | (query) } [ NumLevels (number_of_levels) ] ) GROUP BY variable; For classification tree: Function measures importance by Gini impurity decrease. For each split, this is the formula for decrease in Gini impurity: parent_node_Gini - left_node_Gini - right_node_Gini Function records decrease in Gini impurity resulting from each split and accumulates these values for all nodes in all trees in forest, individually for all variables. Specific algorithm for calculating importance is described in "A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data," Bjoern H Menze, B Michael Kelm, Ralf Masuch, Uwe Himmelreich, Peter Bachert, Wolfgang Petrich and Fred A Hamprecht, 2009 (http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-10-213). For regression tree: Function calculates importance using mean squared error, described in "Variable Importance Assessment in Regression: Linear Regression versus Random Forest," Ulrike GRÖMPING 2009 (https://prof.beuth-hochschule.de/fileadmin/prof/groemp/downloads/tast_2E2009_2E08199.pdf). |