Output - Aster Analytics

Teradata Aster Analytics Foundation User Guide

Product
Aster Analytics
Release Number
6.21
Published
November 2016
Language
English (United States)
Last Update
2018-04-14
dita:mapPath
kiu1466024880662.ditamap
dita:ditavalPath
AA-notempfilter_pdf_output.ditaval
dita:id
B700-1021
lifecycle
previous
Product Category
Software

The output of the Forest_Analyze function is a table of model analysis data. The following table shows its schema.

Forest_Analyze Output Table Schema
Column Data Type Description
worker_ip VARCHAR The IP address of the worker that produced the decision tree.
task_index INTEGER The ID of the worker that produced the decision tree.
tree_num INTEGER The ID of the decision tree.
variable VARCHAR A string representation of the decision tree.
level INTEGER The highest level of the decision tree at which the variable appears.
cnt INTEGER The number of times that the variable is used as a split node in the decision tree.
importance DOUBLE PRECISION The importance statistics for each decision tree in the random forest. To find the overall importance of each variable, use this query, where n is the number of trees:
SELECT variable, sum(importance)/n
  FROM Forest_Analyze (
    ON { table | view | (query) }
    [ NumLevels (number_of_levels) ]
  ) GROUP BY variable;

The function measures the importance for a classification tree by Gini impurity decrease. For each split, the decrease in Gini impurity is: parent_node_Gini - left_node_Gini - right_node_Gini. The function records the decrease in Gini impurity resulting from each split and accumulates these values for all nodes in all trees in the forest, individually for all variables. The specific algorithm for calculating importance is described in the paper "A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data," Bjoern H Menze, B Michael Kelm, Ralf Masuch, Uwe Himmelreich, Peter Bachert, Wolfgang Petrich and Fred A Hamprecht, 2009 (http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-10-213).

For regression, the function calculates importance using the mean squared error, described in the paper "Variable Importance Assessment in Regression: Linear Regression versus Random Forest," Ulrike GRÖMPING 2009 (https://prof.beuth-hochschule.de/fileadmin/user/groemping/downloads/tast_2E2009_2E08199.pdf).