Splitting on Information Gain Ratio | Vantage Analytics Library - Splitting on Information Gain Ratio

Splitting on Information Gain Ratio | Vantage Analytics Library - Splitting on Information Gain Ratio - Vantage Analytics Library

Vantage Analytics Library User Guide

Deployment

VantageCloud

VantageCore

Edition

Enterprise

IntelliFlex

Lake

VMware

Product

Vantage Analytics Library

Release Number

2.2.0

Published

March 2023

Language

English (United States)

Last Update

2024-01-02

dita:mapPath

ibw1595473364329.ditamap

dita:ditavalPath

iup1603985291876.ditaval

dita:id

zyl1473786378775

Product Category

Teradata Vantage

The formulas in this topic use these terms:

Term	Description
t	Node
j	Learning class
J	Number of classes
s	Split
N(t)	Number of cases within node t
p(j\|t)	Proportion of class j learning samples in node t
ϕ	Impurity function, a symmetric function with maximum value: ϕ(J-1, J-1, …, J-1) ϕ(1, 0, …, 0) = ϕ(0, 1, …, 0) = … = ϕ(0, 0, …, 1) = 0
ti	Subnode i of node t
i(t)	Impurity measure of node t
t1	Left-split subnode of node t
tR	Right-split subnode of node t
X	Predictor variable

An information gain ratio tree splits on categorical variables on each individual value and continuous variables at one point in an ordered list of actual values (that is, a binary splits on a particular value is introduced).

The tree uses this procedure for splitting:

Define info(t) at node t as the entropy:
Define the following, where node t is split into subnodes by predictor variable ssX:
Use the attribute with the highest gain ratio to split the data.
Repeat this procedure on each subset until the observations are all of one class or a stopping criterion is trues (for example, "each node must contain at least two observations").

For a detailed description of an information gain ratio tree, see [Quinlan].