In the original Random Forest algorithm developed by Leo Breiman, each tree grows as follows:
- If the number of cases in the training set is N, sample N cases at random, but with replacement from the original data. This sample becomes the training set for growing the tree.
- If there are M input variables, a number m<<M is specified such that at each node, m variables are selected at random from M and the best split on those m variables is used to split the node. The value of m is held constant during the forest growing.
- Each tree is grown to the largest extent possible. There is no pruning.
Teradata Aster’s implementation of the Random Forest algorithm differs from Leo Breiman’s algorithm in the following ways:
- The Forest_Drive function lets you specify m using the optional argument Mtry. If you do not specify Mtry, the function uses all variables to train the decision tree (equivalent to bootstrap aggregating or bagging).
- The Forest_Drive function randomly assigns rows to individual vworkers. Each vworker creates trees with a bootstrapping technique, using only its local data.