Training a decision tree model is a relatively automatic procedure, but for best performance, be aware of the following:
- The DecisionForest function computes parameters that are important for the performance of the model. If necessary, you can set these parameters to improve function performance:
By default, the DecisionForest function builds the number of trees such that the total number of sampled points is equal to the size of the original input data set. For example, if your input data set contains one billion rows, and the function determines that each tree must be trained on a sample of one million rows, the function trains 1,000 trees. Depending on your data set, you might want more or fewer trees. Generally, a model of 300 decision trees works well for most prediction tasks. If your data set is small, specify a value for NumTrees that is a multiple of the number of vworkers in your cluster.
Each decision tree is built on a sample of the original data set. The function computes the value of this parameter such that the decision tree algorithm does not run out of memory. With the TreeSize parameter, you can specify how many rows each decision tree is to contain. Setting this parameter too high can result in Out of Memory errors.
- If a variable has more than possible 20 values, consolidate some of the categories to improve runtime performance.
- If the trees in the output model table are too large, or the model has too many trees, the DecisionForestPredict_MLE function can fail and output NaNs as predictions (NaN means "not a number").
- Each vworker trains decision trees using a subsample of the data on its partition. Significant data skew can produce strange results.
- For better efficiency when running the DecisionForest function, distribute the training data in the input table randomly.