Description
The Decision Tree (td_decision_tree_mle
) function creates a single
decision tree in a distributed fashion, either weighted or unweighted.
The model table that this function outputs can be input to the function
Decision Tree Predict (td_decision_tree_predict_sqle
).
Usage
td_decision_tree_mle (
data = NULL,
attribute.name.columns = NULL,
attribute.value.column = NULL,
id.columns = NULL,
attribute.table = NULL,
response.table = NULL,
response.column = NULL,
categorical.attribute.table = NULL,
splits.table = NULL,
split.value = NULL,
num.splits = 10,
approx.splits = TRUE,
nodesize = 100,
max.depth = 30,
weighted = FALSE,
weight.column = NULL,
split.measure = "gini",
output.response.probdist = FALSE,
response.probdist.type = "Laplace",
categorical.encoding = "graycode",
attribute.table.sequence.column = NULL,
data.sequence.column = NULL,
categorical.attribute.table.sequence.column = NULL,
response.table.sequence.column = NULL,
splits.table.sequence.column = NULL
)
Arguments
data |
Optional Argument. Required if you omit "attribute.table" and
"response.table" arguments.
Specifies the name of the tbl_teradata that contains the input data set.
|
attribute.name.columns |
Required Argument.
Specifies the names of the attribute tbl_teradata columns that define the
attribute.
Types: character OR vector of Strings (character)
|
attribute.value.column |
Required Argument.
Specifies the names of the attribute tbl_teradata column that define the
value.
Types: character
|
id.columns |
Required Argument.
Specifies the names of the columns in the response and attribute
tbl_teradata objects that specify the ID of the instance.
Types: character OR vector of Strings (character)
|
attribute.table |
Optional Argument. Required if you omit "data" argument.
Specifies the name of the tbl_teradata that contains the attribute names and
the values.
|
response.table |
Optional Argument. Required if you omit "data" argument.
Specifies the name of the tbl_teradata that contains the response values.
|
response.column |
Required Argument.
Specifies the name of the response tbl_teradata column that contains the
response variable.
Types: character
|
categorical.attribute.table |
Optional Argument.
Specifies the name of input tbl_teradata that contains categorical attributes.
|
splits.table |
Optional Argument.
Specifies the name of the input tbl_teradata that contains the
user-specified splits. By default, the function creates new splits.
|
split.value |
Optional Argument.
If you specify the argument "splits.table", this argument specifies the
name of the column that contains the split value. If "approx.splits" is TRUE,
then the default value is splits_valcol; if not, then the default
value is the attribute.value.column argument, node_column.
Types: character
|
num.splits |
Optional Argument.
Specifies the number of splits to consider for each variable. The
function does not consider all possible splits for all attributes.
Default Value: 10
Types: numeric
|
approx.splits |
Optional Argument.
Specifies whether to use approximate percentiles (TRUE) or exact
percentiles (FALSE).
Default Value: TRUE
Types: logical
|
nodesize |
Optional Argument.
Specifies the decision tree stopping criterion and the minimum size
of any particular node within each decision tree.
Default Value: 100
Types: numeric
|
max.depth |
Optional Argument.
Specifies a decision tree stopping criterion. If the tree reaches a
depth past this value, the algorithm stops looking for splits.
Decision trees can grow up to (2(max_depth+1) - 1) nodes. This
stopping criteria has the greatest effect on function performance.
The maximum value is 60.
Default Value: 30
Types: numeric
|
weighted |
Optional Argument.
Specifies whether to build a weighted decision tree. If you specify
"TRUE", then you must also specify the "weight.column" argument.
Default Value: FALSE
Types: logical
|
weight.column |
Optional Argument.
Specifies the name of the response table column that contains the
weights of the attribute values.
Types: character
|
split.measure |
Optional Argument.
Specifies the impurity measurement to use while constructing the
decision tree.
Default Value: "gini"
Permitted Values: GINI, ENTROPY, CHISQUARE
Types: character
|
output.response.probdist |
Optional Argument.
Specifies switch to enable or disable output of probability distribution for
output labels.
Default Value: FALSE
Types: logical
Note: This argument argument can accept input value TRUE
only when tdplyr is connected to Vantage 1.0 Maintenance
Update 2 version or later.
|
response.probdist.type |
Optional Argument.
Specifies the type of algorithm to use to generate output probability
distribution for output labels.
Default Value: "Laplace"
Permitted Values: Laplace, Frequency, RawCount
Types: character
Note: This argument can only be used when "output.response.probdist" is
set to TRUE.
|
categorical.encoding |
Optional Argument.
Specifies which encoding method is used for categorical variables.
Default Value: "graycode"
Permitted Values: graycode, hashing
Types: character
Note: This argument is supported when tdplyr is connected to Vantage 1.1
or later versions.
|
attribute.table.sequence.column |
Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row
of the input argument "attribute.table". The argument is used to
ensure deterministic results for functions which produce results that
vary from run to run.
Types: character OR vector of Strings (character)
|
data.sequence.column |
Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row
of the input argument "data". The argument is used to ensure
deterministic results for functions which produce results that vary
from run to run.
Types: character OR vector of Strings (character)
|
categorical.attribute.table.sequence.column |
Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row
of the input argument "categorical.attribute.table". The argument is
used to ensure deterministic results for functions which produce
results that vary from run to run.
Types: character OR vector of Strings (character)
|
response.table.sequence.column |
Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row
of the input argument "response.table". The argument is used to
ensure deterministic results for functions which produce results that
vary from run to run.
Types: character OR vector of Strings (character)
|
splits.table.sequence.column |
Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row
of the input argument "splits.table". The argument is used to ensure
deterministic results for functions which produce results that vary
from run to run.
Types: character OR vector of Strings (character)
|
Value
Function returns an object of class "td_decision_tree_mle" which is a
named list containing Teradata tbl objects.
Named list members can be referenced directly with the "$" operator
using following names:
model.table
-
intermediate.splits.table
final.response.tableto
output
Examples
# Get the current context/connection
con <- td_get_context()$connection
# Load example data.
loadExampleData("decision_tree_example", "iris_attribute_train", "iris_response_train", "iris_altinput")
# Create remote tibble objects.
iris_attribute_train <- tbl(con, "iris_attribute_train")
iris_response_train <- tbl(con, "iris_response_train")
iris_altinput <- tbl(con, "iris_altinput")
# Example 1 - Create decision tree by specifying attribute and response tables.
td_decision_tree_out1 <- td_decision_tree_mle(attribute.name.columns = c("attribute"),
attribute.value.column = "attrvalue",
id.columns = c("pid"),
attribute.table = iris_attribute_train,
response.table = iris_response_train,
response.column = "response",
num.splits = 3,
approx.splits = FALSE,
nodesize = 10,
max.depth = 10,
split.measure = "gini"
)
# Example 2 - Create decision tree by specifying only the "data" argument.
td_decision_tree_out2 <- td_decision_tree_mle(data = iris_altinput,
attribute.name.columns = c("attribute"),
attribute.value.column = "attrvalue",
id.columns = c("pid"),
response.column = "response",
num.splits = 3,
approx.splits = FALSE,
nodesize = 10,
max.depth = 10,
split.measure = "gini"
)