Decision forest functions create predictive models based on the algorithm for decision-tree training and prediction.
The TD_DecisionForest function is an ensemble algorithm used for classification and regression predictive modeling problems. It is an extension of bootstrap aggregation (bagging) of decision trees. The function supports regression, binary, and multi-class classification.
Constructing a decision tree typically involves evaluating the value for each input feature in the data to select a split point. The function reduces the features to a random subset that can be considered at each split point. The algorithm can force each decision tree in the forest to be very different to improve prediction accuracy.
Each node in the tree represents a decision based on the value of a single feature, and the tree is grown by iteratively splitting the data into smaller and smaller subsets based on these decisions. It repeats this process until it finds the best variable to split the data at a given level of a tree, and repeats it at each level until the stopping criterion is met.
The function reduces the features to a random subset that can be considered at each split point. The algorithm can force each decision tree in the forest to be very different to improve prediction accuracy.
- All input features are numeric. Convert the categorical columns to numerical columns as preprocessing step.
- For classification, class labels (ResponseColumn values) can only be integers. A maximum of 500 classes is supported for classification.
- Any observation with a missing value in an input column is skipped and not used for training. You can use the TD_SimpleImpute function to assign missing values.
- When you specify the NumTrees value, the number of trees built by the function is adjusted as:
Number_of_trees = Num_AMPs_with_data * (NumTrees/Num_AMPs_with_data)
- For Num_AMPs_with_data value, use the SQL command SELECT HASHAMP()+1;.
- When you do not specify the NumTrees value, the number of trees built by an AMP is calculated as:
Number_of_AMP_trees = CoverageFactor * Num_Rows_AMP / TreeSize
The number of trees built by the function is the sum of Number_of_AMP_trees.
When a data set is small, best practice is to distribute the data to one AMP. To do this, create an identifier column as a primary index, and use the same value for each row. - The TreeSize value determines the sample size used to build a tree in the forest and depends on the memory available to the AMP. By default, this value is computed internally by the function. The function reserves approximately 40% of its available memory to store the input sample, while the rest is used to build the tree.
The function uses a training dataset to create a predictive model. The TD_DecisionForestPredict function uses the model created by the TD_DecisionForest function for making predictions. See TD_DecisionForestPredict.
- Convert the categorical columns to numerical columns.
- Determine the parameters to use with the function, such as tree depth, model type, and number of trees.
- Use TD_DecisionForest on a training dataset to create a predictive model.
- Use TD_DecisionForestPredict function on the model created by the TD_DecisionForest function to make predictions.