The DecisionTree function creates a single decision tree in a distributed fashion, either weighted or unweighted. The model tbl_teradata that this function outputs can be input to the function DecisionTreePredict (td_decision_tree_predict_sqle) or DecisionTreePredict (td_decision_tree_predict_mle.


  td_decision_tree_mle (
      data = NULL,
      attribute.name.columns = NULL,
      attribute.value.column = NULL,
      id.columns = NULL,
      attribute.table = NULL,
      response.table = NULL,
      response.column = NULL,
      categorical.attribute.table = NULL,
      splits.table = NULL,
      split.value = NULL,
      num.splits = 10,
      approx.splits = TRUE,
      nodesize = 100,
      max.depth = 30,
      weighted = FALSE,
      weight.column = NULL,
      split.measure = "gini",
      output.response.probdist = FALSE,
      response.probdist.type = "Laplace",
      categorical.encoding = "graycode",
      attribute.table.sequence.column = NULL,
      data.sequence.column = NULL,
      categorical.attribute.table.sequence.column = NULL,
      response.table.sequence.column = NULL,
      splits.table.sequence.column = NULL



Optional Argument. Required if you omit "attribute.table" and "response.table" arguments.
Specifies the name of the tbl_teradata that contains the input data set.


Required Argument.
Specifies the names of the attribute tbl_teradata columns that define the attribute.
Types: character OR vector of Strings (character)


Required Argument.
Specifies the names of the attribute tbl_teradata column that define the value.
Types: character


Required Argument.
Specifies the names of the columns in the response and attribute objects of class "tbl_teradata" that specify the ID of the instance.
Types: character OR vector of Strings (character)


Optional Argument. Required if you omit "data" argument.
Specifies the name of the tbl_teradata that contains the attribute names and the values.


Optional Argument. Required if you omit "data" argument.
Specifies the name of the tbl_teradata that contains the response values.


Required Argument.
Specifies the name of the response tbl_teradata column that contains the response variable.
Types: character


Optional Argument.
Specifies the name of the input tbl_teradata containing the categorical attributes.


Optional Argument.
Specifies the name of the input tbl_teradata that contains the user-specified splits. By default, the function creates new splits.


Optional Argument.
If you specify the argument "splits.table", this argument specifies the name of the column that contains the split value. If "approx.splits" is TRUE, then the default value is splits_valcol; if not, then the default value is the "attribute.value.column" argument, node_column.
Types: character


Optional Argument.
Specifies the number of splits to consider for each variable. The function does not consider all possible splits for all attributes.
Default Value: 10
Types: integer


Optional Argument.
Specifies whether to use approximate percentiles (TRUE) or exact percentiles (FALSE). Internally, the function uses percentile values as split values.
Default Value: TRUE
Types: logical


Optional Argument.
Specifies the decision tree stopping criterion and the minimum size of any particular node within each decision tree.
Default Value: 100
Types: integer


Optional Argument.
Specifies a decision tree stopping criterion. If the tree reaches a depth past this value, the algorithm stops looking for splits. Decision trees can grow up to (2(max_depth+1) - 1) nodes. This stopping criteria has the greatest effect on function performance. The maximum value is 60.
Default Value: 30
Types: integer


Optional Argument.
Specifies whether to build a weighted decision tree. If you specify "TRUE", then you must also specify the "weight.column" argument. Default Value: FALSE
Types: logical


Optional Argument.
Specifies the name of the response tbl_teradata column that contains the weights of the attribute values.
Types: character


Optional Argument.
Specifies the impurity measurement to use while constructing the decision tree.
Default Value: "gini"
Types: character


Optional Argument.
Specifies a flag to enable or disable output of probability distribution for output labels.
Default Value: FALSE
Types: logical.
Note: This argument argument can accept input value TRUE only when tdplyr is connected to Vantage 1.0 Maintenance Update 2 version or later.


Optional Argument.
Specifies the type of algorithm to use to generate output probability distribution for output labels. Uses one of Laplace, Frequency or RawCounts to generate Probability Estimation Trees (PET) based distributions.
Default Value: "Laplace"
Permitted Values: Laplace, Frequency, RawCount
Types: character
Note: This argument can only be used when "output.response.probdist" is set to TRUE.


Optional Argument.
Specifies which encoding method is used for categorical variables.
Default Value: "graycode"
Permitted Values: graycode, hashing
Types: character
Note: This argument is supported when tdplyr is connected to Vantage 1.1 or later versions.


Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "attribute.table". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)


Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)


Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "categorical.attribute.table". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)


Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "response.table". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)


Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "splits.table". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)


Function returns an object of class "td_decision_tree_mle" which is a named list containing Teradata tbl objects. Named list members can be referenced directly with the "$" operator using following names:

  1. model.table

  2. intermediate.splits.table

  3. final.response.tableto

  4. output


    # Get the current context/connection
    con <- td_get_context()$connection

    # Load example data.
    loadExampleData("decision_tree_example", "iris_attribute_train", "iris_response_train",

    # Create object(s) of class "tbl_teradata".
    iris_attribute_train <- tbl(con, "iris_attribute_train")
    iris_response_train <- tbl(con, "iris_response_train")
    iris_altinput <- tbl(con, "iris_altinput")

    # Example 1 - Create decision tree by specifying attribute and response tables.
    td_decision_tree_out1 <- td_decision_tree_mle(attribute.name.columns = c("attribute"),
                                                  attribute.value.column = "attrvalue",
                                                  id.columns = c("pid"),
                                                  attribute.table = iris_attribute_train,
                                                  response.table = iris_response_train,
                                                  response.column = "response",
                                                  num.splits = 3,
                                                  approx.splits = FALSE,
                                                  nodesize = 10,
                                                  max.depth = 10,
                                                  split.measure = "gini"

    # Example 2 - Create decision tree by specifying only the "data" argument.
    td_decision_tree_out2 <- td_decision_tree_mle(data = iris_altinput,
                                                  attribute.name.columns = c("attribute"),
                                                  attribute.value.column = "attrvalue",
                                                  id.columns = c("pid"),
                                                  response.column = "response",
                                                  num.splits = 3,
                                                  approx.splits = FALSE,
                                                  nodesize = 10,
                                                  max.depth = 10,
                                                  split.measure = "gini"