Teradata R Package Function Reference - DecisionForest - Teradata R Package - Look here for syntax, methods and examples for the functions included in the Teradata R Package.

Teradata® R Package Function Reference

Product
Teradata R Package
Release Number
16.20
Published
February 2020
Language
English (United States)
Last Update
2020-02-28
dita:id
B700-4007
lifecycle
previous
Product Category
Teradata Vantage

Description

The Forest Drive function uses a training data set to generate a predictive model. You can input the model to the Forest Predict function, which uses it to make predictions.

Usage

  td_decision_forest_mle (
      formula = NULL,
      data = NULL,
      maxnum.categorical = 20,
      tree.type = NULL,
      ntree = NULL,
      tree.size = NULL,
      nodesize = 1,
      variance = 0,
      max.depth = 12,
      mtry = NULL,
      mtry.seed = NULL,
      seed = NULL,
      outofbag = FALSE,
      display.num.processed.rows = FALSE,
      categorical.encoding = "graycode",
      data.sequence.column = NULL
  )

Arguments

formula

Required Argument.
An object of class "formula". Specifies the model to be fitted. Only basic formula of the (col1 ~ col2 + col3 +...) form are supported and all variables must be from the same tbl_teradata object. The response should be column of type real, numeric, integer or boolean.

data

Required Argument.
Specifies the tbl_teradata containing the input data set.

maxnum.categorical

Optional Argument.
Specifies the maximum number of distinct values for a single categorical variable. The max_cat_values must be a positive numeric. A max_cat_values greater than 20 is not recommended.
Default Value: 20
Types: numeric

tree.type

Optional Argument.
Specifies whether the analysis is a regression (continuous response variable) or a multiclass classification (predicting result from the number of classes). The default value is "regression" if the response variable is numeric and "classification" if the response variable is nonnumeric.
Types: character

ntree

Optional Argument.
Specifies the number of trees to grow in the forest model. When specified, number_of_trees must be greater than or equal to the number of vworkers. When not specified, the function builds the minimum number of trees that provides the input dataset with full coverage.
Types: numeric

tree.size

Optional Argument.
Specifies the number of rows that each tree uses as its input data set. If not specified, the function builds a tree using either the number of rows on a vworker or the number of rows that fit into the memory of vworker, whichever is less.
Types: numeric

nodesize

Optional Argument.
Specifies a decision tree stopping criterion, i.e., the minimum size of any node within each decision tree.
Default Value: 1 Types: numeric

variance

Optional Argument.
Specifies a decision tree stopping criterion. If the variance within any node dips below this value, the algorithm stops looking for splits in the branch.
Default Value: 0
Types: numeric

max.depth

Optional Argument.
Specifies a decision tree stopping criterion. If the tree reaches a depth past this value, the algorithm stops looking for splits.
Decision trees can grow to (2(max_depth+1) - 1) nodes. This stopping criterion has the greatest effect on the performance of the function.
Default Value: 12
Types: numeric

mtry

Optional Argument.
Specifies the number of variables to randomly sample from each input value.
For example, if mtry is 3, then the function randomly samples 3 variables from each input at each split. The mtry must be an numeric.
Types: numeric

mtry.seed

Optional Argument.
Specifies a numeric value to use in determining the random seed for mtry.
Types: numeric

seed

Optional Argument.
Specifies a numeric value to use in determining the seed for the random number generator. If you specify this value, you can specify the same value in future calls to this function and the function will build the same tree.
Types: numeric

outofbag

Optional Argument.
Specifies whether to output the out-of-bag estimate of error rate.
Default Value: FALSE
Types: logical

display.num.processed.rows

Optional Argument.
Specifies whether to display the number of processed rows of input table.
Default Value: FALSE
Types: logical

categorical.encoding

Optional Argument.
Specifies which encoding method is used for categorical variables.
Note: "categorical.encoding" argument support is only available when tdplyr is connected to Vantage 1.1 or later versions.
Default Value: "graycode"
Permitted Values: graycode, hashing
Types: character

data.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

Value

Function returns an object of class "td_decision_forest_mle" which is a named list containing Teradata tbl objects. Named list members can be referenced directly with the "$" operator using following names: 1. predictive.model 2. monitor.table 3. output

Examples

    # Get the current context/connection
    con <- td_get_context()$connection
    
    # Load example data.
    loadExampleData("decisionforest_example", "housing_train", "boston")
    
    # Create remote tibble objects.
    housing_train <- tbl(con, "housing_train")
    boston <- tbl(con, "boston")
    
    # Example 1 -
    td_decision_forest_out1 <- td_decision_forest_mle(formula = (homestyle ~ bedrooms + lotsize + gashw + driveway +
                                          stories + recroom + price + garagepl + bathrms + fullbase + airco +
                                          prefarea),
                               data = housing_train,
                               tree.type = "classification",
                               ntree = 50,
                               nodesize = 1,
                               variance = 0.0,
                               max.depth = 12,
                               mtry = 3,
                               mtry.seed = 100,
                               seed = 100
                               )
    
    # Example 2 -
    td_decision_forest_out2 <- td_decision_forest_mle(formula = (homestyle ~ bedrooms + lotsize + gashw + driveway +
                                          stories + recroom + price + garagepl + bathrms + fullbase + airco +
                                          prefarea),
                               data = housing_train,
                               tree.type = "classification",
                               ntree = 50,
                               nodesize = 2,
                               max.depth = 12,
                               mtry = 3,
                               outofbag = TRUE
                               )
    
    # Example 3 -
    td_decision_forest_out3 <- td_decision_forest_mle(formula = (medv ~ indus + ptratio + lstat + black + tax + dis + zn +
                                          rad + nox + chas + rm + crim + age),
                               data = boston,
                               tree.type = "regression",
                               ntree = 50,
                               nodesize = 2,
                               max.depth = 6,
                               outofbag = TRUE
                               )