Teradata Package for R Function Reference | 17.20 - DecisionForest - Teradata Package for R - Look here for syntax, methods and examples for the functions included in the Teradata Package for R.

Teradata® Package for R Function Reference

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Teradata Package for R
Release Number
17.20
Published
March 2024
Language
English (United States)
Last Update
2024-05-03
dita:id
TeradataR_FxRef_Enterprise_1720
Product Category
Teradata Vantage

DecisionForest

Description

The decision forest model function is an ensemble algorithm used for classification and regression predictive modeling problems. It is an extension of bootstrap aggregation (bagging) of decision trees. Typically, constructing a decision tree involves evaluating the value for each input feature in the data to select a split point.

The function reduces the features to a random subset (that can be considered at each split point); the algorithm can force each decision tree in the forest to be very different to toimprove prediction accuracy.

The function uses a training dataset to create a predictive model. The td_decision_forest_predict_sqle() function uses the model created by the td_decision_forest_sqle() function for making predictions.

The function supports regression, binary, and multi-class classification.

Notes:

  • All input features are numeric. Convert the categorical columns to numerical columns as preprocessing step.

  • For classification, class labels ("response.column" values) can only be integers.

  • Any observation with a missing value in an input column is skipped and not used for training. One can use either td_simple_impute_sqle() or td_fill_na_sqle() and valib. td_transform_sqle() function to assign missing values.

The number of trees built by the function depends on the "num.trees", "tree.size", "coverage.factor" values, and the data distribution in the cluster. The trees are constructed in parallel by all the AMPs, which have a non-empty partition of data.

  • When you specify the "num.trees" value, the number of trees built by the function is adjusted as: "Number_of_trees = Num_AMPs_with_data * (num.trees/Num_AMPs_with_data)"

  • To find out number of AMPs with data value, please use td_hashamp_sqle()

  • When you do not specify the "num.trees" value, the number of trees built by an AMP is calculated as: "Number_of_AMP_trees = coverage.factor * Num_Rows_AMP / tree.size" The number of trees built by the function is the sum of Number_of_AMP_trees.

  • The "tree.size" value determines the sample size used to build a tree in the forest and depends on the memory available to the AMP. By default, this value is computed internally by the function. The function reserves approximately 40% of its available memory to store the input sample, while the rest is used to build the tree.

Usage

  td_decision_forest_sqle (
      formula = NULL,
      data = NULL,
      input.columns = NULL,
      response.column = NULL,
      max.depth = 5,
      num.trees = -1,
      min.node.size = 1,
      mtry.seed = 1,
      seed = 1,
      tree.type = "REGRESSION",
      tree.size = -1,
      coverage.factor = 1.0,
      min.impurity = 0.0,
      ...
  )

Arguments

formula

Required Argument when "input.columns" and "response.column" are not provided, optional otherwise.
Specifies a string consisting of "formula". Specifies the model to be fitted.
Only basic formula of the "col1 ~ col2 + col3 +..." form are supported and all variables must be from the same tbl_teradata object. The response should be column of type float, integer or logical.
Notes:

  • The function only accepts numeric features. User must convert the categorical features to numeric values, before passing to the formula.

  • In case, categorical features are passed to formula, those are ignored, and only numeric features are considered.

  • Provide either "formula" argument or "input.columns" and "response.column" arguments.

Types: character

data

Required Argument.
Specifies the input tbl_teradata.
Types: tbl_teradata

input.columns

Required Argument when "formula" is not provided, optional otherwise.
Specifies the names of the input tbl_teradata columns to be used for
training the model (predictors, features or independent variables).
Note:

  • Provide either "formula" argument or "input.columns" and "response.column" arguments.

Types: character OR vector of Strings (character)

response.column

Required Argument when "formula" is not provided, optional otherwise.
Specifies the name of the column containing the class label for
classification or target value (dependent variable) for regression.
Note:

  • Provide either "formula" argument or "input.columns" and "response.column" arguments.

Types: character

max.depth

Optional Argument.
Specifies a decision tree stopping criterion. If the tree reaches a depth past this value, the algorithm stops looking for splits. Decision trees can grow to (2^(max_depth+1)-1) nodes. This stopping criterion has the greatest effect on the performance of the function Note:
Must be a non-negative integer value.
Default Value: 5
Types: integer

num.trees

Optional Argument.
Specifies the number of trees to grow in the forest model. When specified, the number of trees must be greater than or equal to the number of AMPs with data. By default, the function builds the minimum number of trees that provides the input dataset with coverage based on "coverage.factor".
Default Value: -1
Types: integer

min.node.size

Optional Argument.
Specifies the minimum number of observations in a tree node.
The algorithm stops splitting a node if the number of observations in the node is equal to or smaller than this value. You must specify a non-negative integer value. Default Value: 1
Types: integer

mtry

Optional Argument.
Specifies the number of features from input columns for evaluating the best split of a node. A higher value improves the splitting and performance of a tree. A smaller value improves the robustness of the forest and prevents it from overfitting. When the value is -1, all variables are used for each split.
Default Value: -1
Types: integer

mtry.seed

Optional Argument.
Specifies the random seed that the algorithm uses for the "mtry" argument.
Default Value: 1
Types: integer

seed

Optional Argument.
Specifies the random seed that the algorithm uses for repeatable results.
Default Value: 1
Types: integer

tree.type

Optional Argument.
Specifies whether the analysis is a regression (continuous response
variable) or a multiple-class classification (predicting result from the number of classes).
Default Value: "REGRESSION"
Permitted Values: 'REGRESSION', 'CLASSIFICATION'
Types: character

tree.size

Optional Argument.
Specifies the number of rows that each tree uses as its input dataset.
The function builds a tree using either the number of rows on an AMP, the number of rows that fit into the AMP"s memory (whichever is less), or the number of rows given by the "tree.size" argument. By default, this value is the minimum of the number of rows on an AMP, and the number of rows that fit into the AMP"s memory.
Default Value: -1
Types: integer

coverage.factor

Optional Argument.
Specifies the level of coverage for the dataset while building trees,
in percentage. For example, 1.25 = 125 Notes:

  • "coverage.factor" can only be used when "num.trees" is not specified.

  • When "num.trees" is specified, coverage depends on the value of the "num.trees".

  • When "num.trees" is not specified, "num.trees" is chosen to achieve level of coverage specified by this argument.

  • A higher coverage level will ensure a higher probability of each row in input data to be selected during the tree building process (at the cost of building more trees).

  • Because of internal sampling in bootstrapping, some rows may be chosen multiple times, and some not at all.

Default Value: 1.0
Types: float OR integer

min.impurity

Optional Argument.
Specifies the minimum impurity at which the tree stops splitting
further down. For regression, a criteria of squared error is used whereas for classification, gini impurity is used.
Default Value: 0.0
Types: float OR integer

...

Specifies the generic keyword arguments SQLE functions accept. Below
are the generic keyword arguments:
persist:
Optional Argument.
Specifies whether to persist the results of the
function in a table or not. When set to TRUE, results are persisted in a table; otherwise, results are garbage collected at the end of the session.
Default Value: FALSE
Types: logical

volatile:
Optional Argument.
Specifies whether to put the results of the
function in a volatile table or not. When set to TRUE, results are stored in a volatile table, otherwise not.
Default Value: FALSE
Types: logical

Function allows the user to partition, hash, order or local order the input data. These generic arguments are available for each argument that accepts tbl_teradata as input and can be accessed as:

  • "<input.data.arg.name>.partition.column" accepts character or vector of character (Strings)

  • "<input.data.arg.name>.hash.column" accepts character or vector of character (Strings)

  • "<input.data.arg.name>.order.column" accepts character or vector of character (Strings)

  • "local.order.<input.data.arg.name>" accepts logical

Note:
These generic arguments are supported by tdplyr if the underlying SQL Engine function supports, else an exception is raised.

Value

Function returns an object of class "td_decision_forest_sqle" which is a named list containing object of class "tbl_teradata".
Named list member(s) can be referenced directly with the "$" operator using the name(s):result

Examples

  
    
    # Get the current context/connection.
    con <- td_get_context()$connection
    
    # Load the example data.
    loadExampleData("pmmlpredict_example", "boston")
    
    # Create tbl_teradata object.
    boston_sample <- tbl(con, "boston")
    
    # Check the list of available analytic functions.
    display_analytic_functions()
    
    # Example 1 : Generate decision forest regression model using
    #             input tbl_teradata, input.columns and response.column
    #             instead of formula.
    decisionforest_out <- td_decision_forest_sqle(
                            data = boston_sample,
                            input.columns = c('crim', 'zn', 'indus', 'chas',
                                               'nox', 'rm','age', 'dis', 'rad',
                                               'tax', 'ptratio',
                                               'black', 'lstat'),
                            response.column = 'medv',
                            max.depth = 12,
                            num.trees = 4,
                            min.node.size = 1,
                            mtry = 3,
                            mtry.seed = 1,
                            seed = 1,
                            tree.type = 'REGRESSION')
    # Print the result.
    print(decisionforest_out$result)
    
    # Example 2 : Generate decision forest regression model using
    #             input tbl_teradata and provided formula.
    decisionforest_out <- td_decision_forest_sqle(
                            data = boston_sample,
                            formula = medv ~ crim + zn + indus + chas + nox
                            + rm + age + dis + rad + tax + ptratio + black
                            + lstat,
                            max.depth = 12, 
                            num.trees = 4, 
                            min.node.size = 1, 
                            mtry = 3, 
                            mtry.seed = 1, 
                            seed = 1, 
                            tree.type = 'REGRESSION')
    
    # Print the result.
    print(decisionforest_out$result)