TD_DecisionForest Function | DecisionForest | Teradata Vantage - TD_DecisionForest - Teradata Vantage

Teradata® VantageCloud Lake

Deployment
VantageCloud
Edition
Lake
Product
Teradata Vantage
Published
January 2023
Language
English (United States)
Last Update
2024-04-03
dita:mapPath
phg1621910019905.ditamap
dita:ditavalPath
pny1626732985837.ditaval
dita:id
phg1621910019905

Decision forest functions create predictive models based on the algorithm for decision-tree training and prediction.

The TD_DecisionForest function is an ensemble algorithm used for classification and regression predictive modeling problems. It is an extension of bootstrap aggregation (bagging) of decision trees. The function supports regression, binary, and multi-class classification.

Constructing a decision tree typically involves evaluating the value for each input feature in the data to select a split point. The function reduces the features to a random subset that can be considered at each split point. The algorithm can force each decision tree in the forest to be very different to improve prediction accuracy.

Each node in the tree represents a decision based on the value of a single feature, and the tree is grown by iteratively splitting the data into smaller and smaller subsets based on these decisions. It repeats this process until it finds the best variable to split the data at a given level of a tree, and repeats it at each level until the stopping criterion is met.

The function reduces the features to a random subset that can be considered at each split point. The algorithm can force each decision tree in the forest to be very different to improve prediction accuracy.

Consider the following points:
  • All input features are numeric. Convert the categorical columns to numerical columns as preprocessing step.
  • For classification, class labels (ResponseColumn values) can only be integers. A maximum of 500 classes is supported for classification.
  • Any observation with a missing value in an input column is skipped and not used for training. You can use the TD_SimpleImpute function to assign missing values.
TD_DecisionForest has several parameters that can be tuned to optimize performance, including the number of trees, the maximum depth of each tree, and the minimum number of samples required to split a node. The trees are constructed in parallel by all the AMPs, which have a non-empty partition of data.
  • When you specify the NumTrees value, the number of trees built by the function is adjusted as:
    Number_of_trees = Num_AMPs_with_data * (NumTrees/Num_AMPs_with_data)
  • For Num_AMPs_with_data value, use the SQL command SELECT HASHAMP()+1;.
  • When you do not specify the NumTrees value, the number of trees built by an AMP is calculated as:
    Number_of_AMP_trees = CoverageFactor * Num_Rows_AMP / TreeSize

    The number of trees built by the function is the sum of Number_of_AMP_trees.

    When a data set is small, best practice is to distribute the data to one AMP. To do this, create an identifier column as a primary index, and use the same value for each row.
  • The TreeSize value determines the sample size used to build a tree in the forest and depends on the memory available to the AMP. By default, this value is computed internally by the function. The function reserves approximately 40% of its available memory to store the input sample, while the rest is used to build the tree.
Processing time is controlled by the number of trees, and complexity of the trees. For example, changing CoverageFactor from 1.0 to 2.0 doubles the number of trees and increases processing time of the query.

The function uses a training dataset to create a predictive model. The TD_DecisionForestPredict function uses the model created by the TD_DecisionForest function for making predictions. See TD_DecisionForestPredict.

The following is an example of how to use TD_DecisionForest:
  1. Convert the categorical columns to numerical columns.
  2. Determine the parameters to use with the function, such as tree depth, model type, and number of trees.
  3. Use TD_DecisionForest on a training dataset to create a predictive model.
  4. Use TD_DecisionForestPredict function on the model created by the TD_DecisionForest function to make predictions.