Decision forest functions create predictive models based on the algorithm for decision tree training and prediction.
TD_DecisionForest function is an ensemble algorithm used for classification and regression predictive modeling problems. It is an extension of bootstrap aggregation (bagging) of decision trees. The function supports regression, binary, and multiclass classification.
Constructing a decision tree involves evaluating the value for each input variable in the data to select a split point. The function reduces the variables to a random subset that can be considered at each split point. The algorithm can force each decision tree in the forest to be different to improve prediction accuracy.
Each node in the tree represents a decision based on the value of a single variable, and the tree is grown by iteratively splitting the data into smaller and smaller subsets based on these decisions. It repeats this process until it finds the best variable to split the data at a given level of a tree, and repeats it at each level until the stopping criterion is met.
- All input variables are numeric. Convert the categorical columns to numerical columns as preprocessing step.
- For classification, class labels (ResponseColumn values) can only be integers. Supports a maximum of 500 classes for classification.
- The function skips any observation with a missing value in an input column and is not used for training. Use TD_SimpleImpute function to assign missing values.
- When you specify the NumTrees value, TD_DecisionForest adjusts the number of trees built as:
Number_of_trees = Num_AMPs_with_data * (NumTrees/Num_AMPs_with_data)
- For Num_AMPs_with_data value, use the SQL command SELECT HASHAMP()+1;.
- When you do not specify the NumTrees value, TD_DecisionForest calculates the number of trees built by an AMP as:
Number_of_AMP_trees = CoverageFactor * Num_Rows_AMP / TreeSize
The number of trees built by the function is the sum of Number_of_AMP_trees.
When a data set is small, best practice is to distribute the data to one AMP. To do this, create an identifier column as a primary index, and use the same value for each row. - The TreeSize value determines the sample size used to build a tree in the forest and depends on the memory available to the AMP. By default, TD_DecisionForest computes internally this value. TD_DecisionForest reserves approximately 40% of its available memory to store the input sample, while the rest is used to build the tree.
TD_DecisionForest uses a training dataset to create a predictive model. TD_DecisionForestPredict function uses the model created by TD_DecisionForest function for making predictions. See TD_DecisionForestPredict.
- Convert the categorical columns to numerical columns.
- Determine the parameters to use with the function, such as tree depth, model type, and number of trees.
- Use TD_DecisionForest on a training dataset to create a predictive model.
- Use TD_DecisionForestPredict function on the model created by the TD_DecisionForest function to make predictions.