To calculate overall importance of each variable, you must group the importance reported in the output table by variable and then sum it over all trees. This query in this example shows how to make that calculation.
- Input table: rft_model, output by any of the DecisionForest Examples
In the DecisionForest call that output rft_model, the value of NumTrees was 50; therefore, to calculate the average importance over all trees, you divide by 50 in this SQL call.
SELECT variable_col, SUM(importance)/50 FROM DecisionForestEvaluator ( ON rft_model ) AS dt GROUP BY variable_col ORDER BY 2 DESC;
Variable importance is in descending order. The top three variables for modeling and prediction are price, lotsize, and bedrooms.
variable_col IMPORTANCE ------------ --------------------- price 1.1017219750593588 lotsize 0.19450014830967055 stories 0.07803449640626983 garagepl 0.0707099673003008 bathrms 0.05317959159771956 bedrooms 0.03751879285236206 fullbase 0.016919484930740327 recroom 0.013867229822064479 prefarea 0.013447400310153186 gashw 0.006672653654901577 airco -0.016537859271588122 driveway -0.048091065686165085
Download a zip file of all examples and a SQL script file that creates their input tables from the attachment in the left sidebar.