Splitting on Information Gain Ratio | Vantage Analytics Library - Splitting on Information Gain Ratio - Vantage Analytics Library

Vantage Analytics Library User Guide

Deployment
VantageCloud
VantageCore
Edition
VMware
Enterprise
IntelliFlex
Lake
Product
Vantage Analytics Library
Release Number
2.2.0
Published
June 2025
ft:locale
en-US
ft:lastEdition
2025-07-02
dita:mapPath
ibw1595473364329.ditamap
dita:ditavalPath
iup1603985291876.ditaval
dita:id
zyl1473786378775
Product Category
Teradata Vantage

The formulas in this topic use these terms:

Term Description
t Node
j Learning class
J Number of classes
s Split
N(t) Number of cases within node t
p(j|t) Proportion of class j learning samples in node t
ϕ Impurity function, a symmetric function with maximum value:

ϕ(J-1, J-1, …, J-1)

ϕ(1, 0, …, 0) = ϕ(0, 1, …, 0) = … = ϕ(0, 0, …, 1) = 0

ti Subnode i of node t
i(t) Impurity measure of node t
t1 Left-split subnode of node t
tR Right-split subnode of node t
X Predictor variable

An information gain ratio tree splits on categorical variables on each individual value and continuous variables at one point in an ordered list of actual values (that is, a binary splits on a particular value is introduced).

The tree uses this procedure for splitting:
  1. Define info(t) at node t as the entropy:

    Defining info at node t formula
  2. Define the following, where node t is split into subnodes by predictor variable ssX:

    Info at node t split into subnodes formula

    Predictor X formula

    Splitting nodes formula

    ""
  3. Use the attribute with the highest gain ratio to split the data.
  4. Repeat this procedure on each subset until the observations are all of one class or a stopping criterion is trues (for example, "each node must contain at least two observations").

For a detailed description of an information gain ratio tree, see [Quinlan].