Description
The Forest Drive function uses a training data set to generate a
predictive model. You can input the model to the Forest Predict
function, which uses it to make predictions.
Usage
td_decision_forest_mle (
formula = NULL,
data = NULL,
maxnum.categorical = 20,
tree.type = NULL,
ntree = NULL,
tree.size = NULL,
nodesize = 1,
variance = 0,
max.depth = 12,
mtry = NULL,
mtry.seed = NULL,
seed = NULL,
outofbag = FALSE,
display.num.processed.rows = FALSE,
categorical.encoding = "graycode",
data.sequence.column = NULL
)
Arguments
formula |
Required Argument.
An object of class "formula". Specifies the model to be fitted. Only
basic formula of the (col1 ~ col2 + col3 +...) form are supported and
all variables must be from the same tbl_teradata object. The
response should be column of type real, numeric, integer or boolean.
|
data |
Required Argument.
Specifies the tbl_teradata containing the input data set.
|
maxnum.categorical |
Optional Argument.
Specifies the maximum number of distinct values for a single
categorical variable. The max_cat_values must be a positive numeric.
A max_cat_values greater than 20 is not recommended.
Default Value: 20
Types: numeric
|
tree.type |
Optional Argument.
Specifies whether the analysis is a regression (continuous response
variable) or a multiclass classification (predicting result from the
number of classes). The default value is "regression" if the response
variable is numeric and "classification" if the response variable is
nonnumeric.
Types: character
|
ntree |
Optional Argument.
Specifies the number of trees to grow in the forest model. When
specified, number_of_trees must be greater than or equal to the
number of vworkers. When not specified, the function builds the
minimum number of trees that provides the input dataset with full
coverage.
Types: numeric
|
tree.size |
Optional Argument.
Specifies the number of rows that each tree uses as its input data
set. If not specified, the function builds a tree using either the
number of rows on a vworker or the number of rows that fit into the
memory of vworker, whichever is less.
Types: numeric
|
nodesize |
Optional Argument.
Specifies a decision tree stopping criterion, i.e., the minimum
size of any node within each decision tree.
Default Value: 1
Types: numeric
|
variance |
Optional Argument.
Specifies a decision tree stopping criterion. If the variance within
any node dips below this value, the algorithm stops looking for splits
in the branch.
Default Value: 0
Types: numeric
|
max.depth |
Optional Argument.
Specifies a decision tree stopping criterion. If the tree reaches a
depth past this value, the algorithm stops looking for splits.
Decision trees can grow to (2(max_depth+1) - 1) nodes. This stopping
criterion has the greatest effect on the performance of the function.
Default Value: 12
Types: numeric
|
mtry |
Optional Argument.
Specifies the number of variables to randomly sample from each
input value. For example, if mtry is 3, then the function randomly
samples 3 variables from each input at each split. The mtry must be an
numeric.
Types: numeric
|
mtry.seed |
Optional Argument.
Specifies a numeric value to use in determining the random seed for
mtry.
Types: numeric
|
seed |
Optional Argument.
Specifies a numeric value to use in determining the seed for the
random number generator. If you specify this value, you can specify
the same value in future calls to this function and the function will
build the same tree.
Types: numeric
|
outofbag |
Optional Argument.
Specifies whether to output the out-of-bag estimate of error rate.
Default Value: FALSE
Types: logical
|
display.num.processed.rows |
Optional Argument.
Specifies whether to display the number of processed rows of input
table.
Default Value: FALSE
Types: logical
|
categorical.encoding |
Optional Argument.
Specifies which encoding method is used for categorical variables.
Note: "categorical.encoding" argument support is only available when tdplyr is
connected to Vantage 1.1 or later versions.
Default Value: "graycode"
Permitted Values: graycode, hashing
Types: character
|
data.sequence.column |
Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row
of the input argument "data". The argument is used to ensure
deterministic results for functions which produce results that vary
from run to run.
Types: character OR vector of Strings (character)
|
Value
Function returns an object of class "td_decision_forest_mle" which is a
named list containing Teradata tbl objects.
Named list members can be referenced directly with the "$" operator
using following names:
1. predictive.model
2. monitor.table
3. output
Examples
# Get the current context/connection
con <- td_get_context()$connection
# Load example data.
loadExampleData("decisionforest_example", "housing_train", "boston")
# Create remote tibble objects.
housing_train <- tbl(con, "housing_train")
boston <- tbl(con, "boston")
# Example 1 -
td_decision_forest_out1 <- td_decision_forest_mle(formula = (homestyle ~ bedrooms + lotsize + gashw + driveway +
stories + recroom + price + garagepl + bathrms + fullbase + airco +
prefarea),
data = housing_train,
tree.type = "classification",
ntree = 50,
nodesize = 1,
variance = 0.0,
max.depth = 12,
mtry = 3,
mtry.seed = 100,
seed = 100
)
# Example 2 -
td_decision_forest_out2 <- td_decision_forest_mle(formula = (homestyle ~ bedrooms + lotsize + gashw + driveway +
stories + recroom + price + garagepl + bathrms + fullbase + airco +
prefarea),
data = housing_train,
tree.type = "classification",
ntree = 50,
nodesize = 2,
max.depth = 12,
mtry = 3,
outofbag = TRUE
)
# Example 3 -
td_decision_forest_out3 <- td_decision_forest_mle(formula = (medv ~ indus + ptratio + lstat + black + tax + dis + zn +
rad + nox + chas + rm + crim + age),
data = boston,
tree.type = "regression",
ntree = 50,
nodesize = 2,
max.depth = 6,
outofbag = TRUE
)