Splitting on Gini Diversity Index - Teradata Warehouse Miner

Teradata Warehouse Miner User Guide - Volume 3Analytic Functions

Product
Teradata Warehouse Miner
Release Number
5.4.5
Published
February 2018
Language
English (United States)
Last Update
2018-05-04
dita:mapPath
yuy1504291362546.ditamap
dita:ditavalPath
ft:empty
dita:id
B035-2302
Product Category
Software
Node impurity is the idea behind the Gini diversity index split selection. To measure node impurity, use the formula:


Maximum impurity arises when there is an equal distribution of the class that is to be predicted. As in the heads and tails example, impurity is highest if half the total is heads and the other half is tails. On the other hand, if there were only tails in a certain sample the impurity would be 0.

The Gini index uses the following formula for its calculation of impurity:



For a determination of the goodness of a split, the following formula is used:



where tL and tR are the left and right sub nodes of t and pL and pR are the probabilities of being in those sub nodes.

For a detailed description of this type of tree, see [Breiman, Friedman, Olshen and Stone].