Teradata R Package Function Reference | 17.00 - 17.00 - DecisionTree - Teradata R Package

Teradata® R Package Function Reference

prodname
Teradata R Package
vrm_release
17.00
created_date
September 2020
category
Programming Reference
featnum
B700-4007-090K

Description

The DecisionTree function creates a single decision tree in a distributed fashion, either weighted or unweighted. The model tbl_teradata that this function outputs can be input to the function DecisionTreePredict (td_decision_tree_predict_sqle) or DecisionTreePredict (td_decision_tree_predict_mle.

Usage

  td_decision_tree_mle (
      data = NULL,
      attribute.name.columns = NULL,
      attribute.value.column = NULL,
      id.columns = NULL,
      attribute.table = NULL,
      response.table = NULL,
      response.column = NULL,
      categorical.attribute.table = NULL,
      splits.table = NULL,
      split.value = NULL,
      num.splits = 10,
      approx.splits = TRUE,
      nodesize = 100,
      max.depth = 30,
      weighted = FALSE,
      weight.column = NULL,
      split.measure = "gini",
      output.response.probdist = FALSE,
      response.probdist.type = "Laplace",
      categorical.encoding = "graycode",
      attribute.table.sequence.column = NULL,
      data.sequence.column = NULL,
      categorical.attribute.table.sequence.column = NULL,
      response.table.sequence.column = NULL,
      splits.table.sequence.column = NULL
  )

Arguments

data

Optional Argument. Required if you omit "attribute.table" and "response.table" arguments.
Specifies the name of the tbl_teradata that contains the input data set.

attribute.name.columns

Required Argument.
Specifies the names of the attribute tbl_teradata columns that define the attribute.
Types: character OR vector of Strings (character)

attribute.value.column

Required Argument.
Specifies the names of the attribute tbl_teradata column that define the value.
Types: character

id.columns

Required Argument.
Specifies the names of the columns in the response and attribute objects of class "tbl_teradata" that specify the ID of the instance.
Types: character OR vector of Strings (character)

attribute.table

Optional Argument. Required if you omit "data" argument.
Specifies the name of the tbl_teradata that contains the attribute names and the values.

response.table

Optional Argument. Required if you omit "data" argument.
Specifies the name of the tbl_teradata that contains the response values.

response.column

Required Argument.
Specifies the name of the response tbl_teradata column that contains the response variable.
Types: character

categorical.attribute.table

Optional Argument.
Specifies the name of the input tbl_teradata containing the categorical attributes.

splits.table

Optional Argument.
Specifies the name of the input tbl_teradata that contains the user-specified splits. By default, the function creates new splits.

split.value

Optional Argument.
If you specify the argument "splits.table", this argument specifies the name of the column that contains the split value. If "approx.splits" is TRUE, then the default value is splits_valcol; if not, then the default value is the "attribute.value.column" argument, node_column.
Types: character

num.splits

Optional Argument.
Specifies the number of splits to consider for each variable. The function does not consider all possible splits for all attributes.
Default Value: 10
Types: integer

approx.splits

Optional Argument.
Specifies whether to use approximate percentiles (TRUE) or exact percentiles (FALSE). Internally, the function uses percentile values as split values.
Default Value: TRUE
Types: logical

nodesize

Optional Argument.
Specifies the decision tree stopping criterion and the minimum size of any particular node within each decision tree.
Default Value: 100
Types: integer

max.depth

Optional Argument.
Specifies a decision tree stopping criterion. If the tree reaches a depth past this value, the algorithm stops looking for splits. Decision trees can grow up to (2(max_depth+1) - 1) nodes. This stopping criteria has the greatest effect on function performance. The maximum value is 60.
Default Value: 30
Types: integer

weighted

Optional Argument.
Specifies whether to build a weighted decision tree. If you specify "TRUE", then you must also specify the "weight.column" argument. Default Value: FALSE
Types: logical

weight.column

Optional Argument.
Specifies the name of the response tbl_teradata column that contains the weights of the attribute values.
Types: character

split.measure

Optional Argument.
Specifies the impurity measurement to use while constructing the decision tree.
Default Value: "gini"
Permitted Values: GINI, ENTROPY, CHISQUARE
Types: character

output.response.probdist

Optional Argument.
Specifies a flag to enable or disable output of probability distribution for output labels.
Default Value: FALSE
Types: logical.
Note: This argument argument can accept input value TRUE only when tdplyr is connected to Vantage 1.0 Maintenance Update 2 version or later.

response.probdist.type

Optional Argument.
Specifies the type of algorithm to use to generate output probability distribution for output labels. Uses one of Laplace, Frequency or RawCounts to generate Probability Estimation Trees (PET) based distributions.
Default Value: "Laplace"
Permitted Values: Laplace, Frequency, RawCount
Types: character
Note: This argument can only be used when "output.response.probdist" is set to TRUE.

categorical.encoding

Optional Argument.
Specifies which encoding method is used for categorical variables.
Default Value: "graycode"
Permitted Values: graycode, hashing
Types: character
Note: This argument is supported when tdplyr is connected to Vantage 1.1 or later versions.

attribute.table.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "attribute.table". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

data.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

categorical.attribute.table.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "categorical.attribute.table". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

response.table.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "response.table". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

splits.table.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "splits.table". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

Value

Function returns an object of class "td_decision_tree_mle" which is a named list containing Teradata tbl objects. Named list members can be referenced directly with the "$" operator using following names:

  1. model.table

  2. intermediate.splits.table

  3. final.response.tableto

  4. output

Examples

    # Get the current context/connection
    con <- td_get_context()$connection

    # Load example data.
    loadExampleData("decision_tree_example", "iris_attribute_train", "iris_response_train", "iris_altinput")
    
    # Create object(s) of class "tbl_teradata".
    iris_attribute_train <- tbl(con, "iris_attribute_train")
    iris_response_train <- tbl(con, "iris_response_train")
    iris_altinput <- tbl(con, "iris_altinput")
    
    # Example 1 - Create decision tree by specifying attribute and response tables.
    td_decision_tree_out1 <- td_decision_tree_mle(attribute.name.columns = c("attribute"),
                                                  attribute.value.column = "attrvalue",
                                                  id.columns = c("pid"),
                                                  attribute.table = iris_attribute_train,
                                                  response.table = iris_response_train,
                                                  response.column = "response",
                                                  num.splits = 3,
                                                  approx.splits = FALSE,
                                                  nodesize = 10,
                                                  max.depth = 10,
                                                  split.measure = "gini"
                                                  )
    
    # Example 2 - Create decision tree by specifying only the "data" argument.
    td_decision_tree_out2 <- td_decision_tree_mle(data = iris_altinput,
                                                  attribute.name.columns = c("attribute"),
                                                  attribute.value.column = "attrvalue",
                                                  id.columns = c("pid"),
                                                  response.column = "response",
                                                  num.splits = 3,
                                                  approx.splits = FALSE,
                                                  nodesize = 10,
                                                  max.depth = 10,
                                                  split.measure = "gini"
                                                  )