BaggingOne way to address the high variance of a single decision tree is to use bootstrapping techniques to build multiple trees from the same data set. By sampling with replacement, you can create multiple data sets from the original training data set. You can then create a decision tree based on each data set and combine the results to make a prediction for any observation. This technique is called bagging. (For more information about bagging, see James, Whitten, Hastie, Tibshirani. An Introduction to Statistical Learning with Applications in R, available at http://www-bcf.usc.edu/~gareth/ISL/.)
Random Forest Algorithm
If a data set has many highly correlated variables, or a few important variables, bagging creates many highly correlated trees, and the model may still have a large variance. The random forest algorithm addresses this problem by trying to create a diverse, less correlated set of trees from the training data set. Instead of considering all predictors at each split and selecting the best one, the algorithm randomly selects a different subset of predictors to consider at each split and selects the best predictor from only this subset. Because the algorithm is forced to use a more diverse set of predictors, it creates a less correlated set of trees.
- The higher correlation between trees, the higher the error rate.
- The stronger the individual trees, the lower the overall error rate.
Aster Analytics Implementation of Random Forest Algorithm
The SQL-MapReduce random forest functions implement an algorithm for decision tree training and prediction based on Classification and Regression Trees, by Breiman, Friedman, Olshen and Stone (1984).
In the original random forest algorithm developed by Leo Breiman, each tree grows as follows:
- If the number of cases in the training set is N, sample N cases at random, but with replacement from the original data. This sample becomes the training set for growing the tree.
- If there are M input variables, a number m<<M is specified such that at each node, m variables are selected at random from M and the best split on those m variables is used to split the node. The value of m is held constant during the forest growing.
- Each tree is grown to the largest extent possible. There is no pruning.
The Aster Analytics implementation of the random forest algorithm differs from Leo Breiman's algorithm in the following ways:
- The Forest_Drive function lets you specify m using the optional argument Mtry. If you do not specify Mtry, the function uses all variables to train the decision tree (equivalent to bootstrap aggregating or bagging).
- The Forest_Drive function randomly assigns rows to individual vworkers. Each vworker creates trees with a bootstrapping technique, using only its local data.
The SQL-MapReduce random forest functions create a decision model that predicts an outcome based on a set of input variables. When constructing the tree, the splitting of branches stops when any stopping criterium is met.
The SQL-MapReduce random forest functions support these predictive models:
|Regression problems (continuous response variable)||This model is used when the predicted outcome from the data is a real number. For example, the dollar amount of insurance claims for a year or the GPA expected for a college student.|
|Multiple-class classification (classification tree analysis)||This model is used to classify data by predicting to which provided classes the data belongs. For example, whether the input data is political news, economic news, or sports news.|
|Binary classification (binary response variable)||This model is used to make predictions when the outcome can be represented as a binary value (true/false, yes/no, 0/1). For example, whether the input insurance claim description data represents an accident.|
For more detailed information about the Aster Analytics implementation of the random forest algorithm, including detailed examples, see the Teradata Aster Orange Book "Bagging and Random Forest in Teradata Aster Analytics," available from Teradata.