Training a decision tree model is a relatively automatic procedure, but for best performance, be aware of the following:
- The Forest_Drive function computes parameters that are important for the performance of the model. If necessary, you can set these parameters to improve function performance:
By default, the Forest_Drive function builds the number of trees such that the total number of sampled points is equal to the size of the original input data set. For example, if your input data set contains one billion rows, and the function determines that each tree must be trained on a sample of one million rows, the function trains 1,000 trees. Depending on your data set, you might want more or fewer trees. Generally, a model of 300 decision trees works well for most prediction tasks. If your data set is small, specify a value for NumTrees that is a multiple of the number of vworkers in your cluster.
Each decision tree is built on a sample of the original data set. The function computes the value of this parameter such that the decision tree algorithm does not run out of memory. With the TreeSize parameter, you can specify how many rows each decision tree is to contain. Setting this parameter too high can result in Out of Memory errors.
- You can check progress of the Forest_Builder and Forest_Predict functions in AMC. Log into AMC and click the Processes tab. If a function is still running, you see a process with its name. Click that process name and then click the View Logs link. The logs show stdout from the process, which helps you check progress and diagnose potential problems.
- If a variable has more than possible 20 values, consolidate some of the categories to improve runtime performance.
- If the trees in the output model table are too large, or the model has too many trees, the Forest_Predict function can fail and output NaNs as predictions (NaN means "not a number"). If the Forest_Predict logs in AMC show NaNs, try one of the following:
- Train fewer decision trees.
- Decrease the MaxDepth parameter in the Forest_Drive function.
- Reduce the cardinality of your categorical input variables.
- Each vworker trains decision trees using a subsample of the data on its partition. Significant data skew can produce strange results.
- For better efficiency when running the Forest_Drive function, distribute the training data in the input table randomly.