1.1 - 8.10 - DecisionForestEvaluator Output - Teradata Vantage

Teradata Vantage™ - Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
1.1
8.10
Release Date
October 2019
Content Type
Programming Reference
Publication ID
B700-4003-079K
Language
English (United States)

Output Table Schema

Column Data Type Description
worker_ip VARCHAR IP address of worker that produced decision tree.
task_index INTEGER Identifier of worker that produced decision tree.
tree_num INTEGER Decision tree identifier.
variable_col VARCHAR Variable name.
level INTEGER Highest level of decision tree at which variable appears.
cnt INTEGER Number of times variable is used as split node in decision tree.
importance DOUBLE PRECISION Importance statistics for each decision tree. To calculate overall importance of each variable, you must group by variable and then take average over all trees. Use this query, where n is number of trees:
SELECT variable, sum(importance)/n
  FROM DecisionForestEvaluator (
    ON { table | view | (query) }
    [ NumLevels (number_of_levels) ]
  ) GROUP BY variable;

For classification tree:

Function measures importance by Gini impurity decrease. For each split, this is the formula for decrease in Gini impurity:

parent_node_Gini - left_node_Gini - right_node_Gini

Function records decrease in Gini impurity resulting from each split and accumulates these values for all nodes in all trees in forest, individually for all variables. Specific algorithm for calculating importance is described in "A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data," Bjoern H Menze, B Michael Kelm, Ralf Masuch, Uwe Himmelreich, Peter Bachert, Wolfgang Petrich and Fred A Hamprecht, 2009 (http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-10-213).

For regression tree:

Function calculates importance using mean squared error, described in "Variable Importance Assessment in Regression: Linear Regression versus Random Forest," Ulrike GRÖMPING 2009 (https://prof.beuth-hochschule.de/fileadmin/prof/groemp/downloads/tast_2E2009_2E08199.pdf).