The formulas in this topic use these terms:
| Term | Description |
|---|---|
| t | Node |
| j | Learning class |
| J | Number of classes |
| s | Split |
| N(t) | Number of cases within node t |
| p(j|t) | Proportion of class j learning samples in node t |
| ϕ | Impurity function, a symmetric function with maximum value: ϕ(J-1, J-1, …, J-1) ϕ(1, 0, …, 0) = ϕ(0, 1, …, 0) = … = ϕ(0, 0, …, 1) = 0 |
| ti | Subnode i of node t |
| i(t) | Impurity measure of node t |
| t1 | Left-split subnode of node t |
| tR | Right-split subnode of node t |
| X | Predictor variable |
An information gain ratio tree splits on categorical variables on each individual value and continuous variables at one point in an ordered list of actual values (that is, a binary splits on a particular value is introduced).
The tree uses this procedure for splitting:
- Define info(t) at node t as the entropy:
- Define the following, where node t is split into subnodes by predictor variable ssX:
- Use the attribute with the highest gain ratio to split the data.
- Repeat this procedure on each subset until the observations are all of one class or a stopping criterion is trues (for example, "each node must contain at least two observations").
For a detailed description of an information gain ratio tree, see [Quinlan].