DecisionForest
Description
The decision forest model function is an ensemble algorithm used for
classification and regression predictive modeling problems.
It is an extension of bootstrap aggregation (bagging) of decision trees.
Typically, constructing a decision tree involves evaluating the value
for each input feature in the data to select a split point.
The function reduces the features to a random subset
(that can be considered at each split point);
the algorithm can force each decision tree in the forest to be very different to
toimprove prediction accuracy.
The function uses a training dataset to create a predictive model.
The td_decision_forest_predict_sqle()
function uses the model
created by the td_decision_forest_sqle()
function for making
predictions.
The function supports regression, binary, and multi-class classification.
Notes:
All input features are numeric. Convert the categorical columns to numerical columns as preprocessing step.
For classification, class labels ("response.column" values) can only be integers.
Any observation with a missing value in an input column is skipped and not used for training. One can use either
td_simple_impute_sqle()
ortd_fill_na_sqle()
and valib.td_transform_sqle()
function to assign missing values.
The number of trees built by the function depends on the "num.trees", "tree.size", "coverage.factor" values, and the data distribution in the cluster. The trees are constructed in parallel by all the AMPs, which have a non-empty partition of data.
When you specify the "num.trees" value, the number of trees built by the function is adjusted as: "Number_of_trees = Num_AMPs_with_data * (num.trees/Num_AMPs_with_data)"
To find out number of AMPs with data value, please use
td_hashamp_sqle()
When you do not specify the "num.trees" value, the number of trees built by an AMP is calculated as: "Number_of_AMP_trees = coverage.factor * Num_Rows_AMP / tree.size" The number of trees built by the function is the sum of Number_of_AMP_trees.
The "tree.size" value determines the sample size used to build a tree in the forest and depends on the memory available to the AMP. By default, this value is computed internally by the function. The function reserves approximately 40% of its available memory to store the input sample, while the rest is used to build the tree.
Usage
td_decision_forest_sqle (
formula = NULL,
data = NULL,
input.columns = NULL,
response.column = NULL,
max.depth = 5,
num.trees = -1,
min.node.size = 1,
mtry.seed = 1,
seed = 1,
tree.type = "REGRESSION",
tree.size = -1,
coverage.factor = 1.0,
min.impurity = 0.0,
...
)
Arguments
formula |
Required Argument when "input.columns" and "response.column"
are not provided, optional otherwise.
Types: character |
data |
Required Argument. |
input.columns |
Required Argument when "formula" is not provided, optional otherwise.
Types: character OR vector of Strings (character) |
response.column |
Required Argument when "formula" is not provided, optional otherwise.
Types: character |
max.depth |
Optional Argument. |
num.trees |
Optional Argument. |
min.node.size |
Optional Argument. |
mtry |
Optional Argument. |
mtry.seed |
Optional Argument. |
seed |
Optional Argument. |
tree.type |
Optional Argument. |
tree.size |
Optional Argument. |
coverage.factor |
Optional Argument.
Default Value: 1.0 |
min.impurity |
Optional Argument. |
... |
Specifies the generic keyword arguments SQLE functions accept. Below volatile: Function allows the user to partition, hash, order or local order the input data. These generic arguments are available for each argument that accepts tbl_teradata as input and can be accessed as:
Note: |
Value
Function returns an object of class "td_decision_forest_sqle"
which is a named list containing object of class "tbl_teradata".
Named list member(s) can be referenced directly with the "$" operator
using the name(s):result
Examples
# Get the current context/connection.
con <- td_get_context()$connection
# Load the example data.
loadExampleData("pmmlpredict_example", "boston")
# Create tbl_teradata object.
boston_sample <- tbl(con, "boston")
# Check the list of available analytic functions.
display_analytic_functions()
# Example 1 : Generate decision forest regression model using
# input tbl_teradata, input.columns and response.column
# instead of formula.
decisionforest_out <- td_decision_forest_sqle(
data = boston_sample,
input.columns = c('crim', 'zn', 'indus', 'chas',
'nox', 'rm','age', 'dis', 'rad',
'tax', 'ptratio',
'black', 'lstat'),
response.column = 'medv',
max.depth = 12,
num.trees = 4,
min.node.size = 1,
mtry = 3,
mtry.seed = 1,
seed = 1,
tree.type = 'REGRESSION')
# Print the result.
print(decisionforest_out$result)
# Example 2 : Generate decision forest regression model using
# input tbl_teradata and provided formula.
decisionforest_out <- td_decision_forest_sqle(
data = boston_sample,
formula = medv ~ crim + zn + indus + chas + nox
+ rm + age + dis + rad + tax + ptratio + black
+ lstat,
max.depth = 12,
num.trees = 4,
min.node.size = 1,
mtry = 3,
mtry.seed = 1,
seed = 1,
tree.type = 'REGRESSION')
# Print the result.
print(decisionforest_out$result)