TD_DecisionForest Usage Notes - Analytics Database

Database Analytic Functions

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Analytics Database
Release Number
17.20
Published
June 2022
ft:locale
en-US
ft:lastEdition
2025-01-20
dita:mapPath
gjn1627595495337.ditamap
dita:ditavalPath
ayr1485454803741.ditaval
dita:id
jmh1512506877710
Product Category
Teradata Vantageā„¢

A decision forest is a machine learning model that is composed of multiple decision trees. A decision tree is a hierarchical structure that makes decisions by recursively splitting the data into subsets based on the values of input variables.

The ways a tree splitting a node dependent on a target variable:

Continuous target variable: reduction in variance.

  1. For each split, individually calculate the variance of each child node
  2. Calculate the variance of each split as the weighted average variance of child nodes
  3. Select the split with the lowest variance
  4. Repeat steps 1-3 until you achieve homogeneous nodes

smaller the variance, the better the split.

Categorical target variable: information gain

  1. For each split, individually calculate the entropy of each child node
  2. Calculate the entropy of each split as the weighted average entropy of child nodes
  3. Select the split with the lowest entropy or highest information gain
  4. Repeat steps 1-3 until you achieve homogeneous nodes

The closer to 1 information gain is, the better the split.

The decision forest algorithm builds an ensemble of decision trees, where the algorithm trains each tree on a randomly sampled subset of the training data and a randomly selected subset of the variables. This randomness helps to prevent overfitting and improve the generalization of the model.

The decision forest algorithm mades the final prediction by aggregating the predictions of all the trees in the forest. You can do this in different ways, such as taking the average prediction for regression problems or taking the majority vote for classification problems.

You can represent the decision function of a decision forest as follows:

where:

  • T is the number of trees in the forest
  • fi(x) is the prediction of the ith tree for the input data point x
  • f(x) is the final prediction of the forest for the input data point x

You can represent each tree in the forest by a set of IF, THEN rules that recursively split the input space into subsets based on the values of the input variables.

You can represent the decision rule for a node j in the ith tree as follows:

Where:

  • xi,k is the value of the kth variable of the input data point
  • tj,k is the threshold value for the kth variable at node j
  • The left and right children of node j correspond to the subsets of the input space where xi,k < tj,k and xi,k >= tj,k, respectively

You can represent the prediction function for a leaf node j in the ith tree as follows:

where cj is the class label or target value associated with leaf node j

Some of the advantages of decision forest includes:
  • Robustness and Generalization: Decision forests can be less prone to overfitting and can generalize well to new data. This is because they combine multiple decision trees, which can help to reduce variance and improve stability of the predictions.
  • Handles high dimensional data: Decision forests can handle high dimensional data, which is common in real world applications such as image recognition, natural language processing, and bioinformatics.
  • Scalability: Decision forests are relatively fast to train and can handle large datasets, making them suitable for big data applications.
  • Nonlinear relationships: Decision forests can model nonlinear relationships between the input features and the target variable, which can be useful when there are complex interactions and dependencies in the data.
  • Parallelism: Decision forests can be trained in parallel, which can significantly reduce the training time and improve efficiency.
Some of the disadvantages of decision forest Includes:
  • Complexity and Interpretability: Decision forests can be complex and difficult to interpret and explain due to the large number of trees and the complexity of their interactions. This can be a disadvantage in situations where interpretability is important, such as in medical diagnosis or legal decision making.
  • Hyperparameter Tuning: Decision forests have several hyperparameters that you need to tune, such as the number of trees, the depth of the trees, and the number of variables to consider at each split. Tuning these hyperparameters can be time consuming and requires domain expertise.
  • Bias-Variance Trade-off: Decision forests can still be prone to overfitting if the number of trees is too small or underfitting if the number of trees is too large. Finding the right balance between bias and variance can be challenging and requires careful tuning of the hyperparameters.
  • Memory Usage: Decision forests can require significant memory to store and maintain the multiple trees and their associated data structures. This can be a limitation in memory constrained environments such as mobile devices or embedded systems.

In summary, a decision forest is a machine learning model that combines multiple decision trees to make predictions. The decision function of the forest is a weighted average of the predictions of all the trees, and each tree is represented by a set of IF, THEN rules that recursively split the input space into subsets based on the values of the input variables.