Splitting on Information Gain Ratio | Vantage Analytics Library - Splitting on Information Gain Ratio - Vantage Analytics Library

Vantage Analytics Library User Guide

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
Lake
VMware
Product
Vantage Analytics Library
Release Number
2.2.0
Published
March 2023
Language
English (United States)
Last Update
2024-01-02
dita:mapPath
ibw1595473364329.ditamap
dita:ditavalPath
iup1603985291876.ditaval
dita:id
zyl1473786378775
Product Category
Teradata Vantage

The formulas in this topic use these terms:

Term Description
t Node
j Learning class
J Number of classes
s Split
N(t) Number of cases within node t
p(j|t) Proportion of class j learning samples in node t
ϕ Impurity function, a symmetric function with maximum value:

ϕ(J-1, J-1, …, J-1)

ϕ(1, 0, …, 0) = ϕ(0, 1, …, 0) = … = ϕ(0, 0, …, 1) = 0

ti Subnode i of node t
i(t) Impurity measure of node t
t1 Left-split subnode of node t
tR Right-split subnode of node t
X Predictor variable

An information gain ratio tree splits on categorical variables on each individual value and continuous variables at one point in an ordered list of actual values (that is, a binary splits on a particular value is introduced).

The tree uses this procedure for splitting:
  1. Define info(t) at node t as the entropy:

    Defining info at node t formula
  2. Define the following, where node t is split into subnodes by predictor variable ssX:

    Info at node t split into subnodes formula

    Predictor X formula

    Splitting nodes formula

    ""
  3. Use the attribute with the highest gain ratio to split the data.
  4. Repeat this procedure on each subset until the observations are all of one class or a stopping criterion is trues (for example, "each node must contain at least two observations").

For a detailed description of an information gain ratio tree, see [Quinlan].