Teradata Package for R Function Reference | 17.00 - DecisionTree - Teradata Package for R - Look here for syntax, methods and examples for the functions included in the Teradata Package for R.

DecisionTree

Description

The DecisionTree function creates a single decision tree in a distributed fashion, either weighted or unweighted. The model tbl_teradata that this function outputs can be input to the function DecisionTreePredict (td_decision_tree_predict_sqle) or DecisionTreePredict (td_decision_tree_predict_mle.

Usage

  td_decision_tree_mle (
      data = NULL,
      attribute.name.columns = NULL,
      attribute.value.column = NULL,
      id.columns = NULL,
      attribute.table = NULL,
      response.table = NULL,
      response.column = NULL,
      categorical.attribute.table = NULL,
      splits.table = NULL,
      split.value = NULL,
      num.splits = 10,
      approx.splits = TRUE,
      nodesize = 100,
      max.depth = 30,
      weighted = FALSE,
      weight.column = NULL,
      split.measure = "gini",
      output.response.probdist = FALSE,
      response.probdist.type = "Laplace",
      categorical.encoding = "graycode",
      attribute.table.sequence.column = NULL,
      data.sequence.column = NULL,
      categorical.attribute.table.sequence.column = NULL,
      response.table.sequence.column = NULL,
      splits.table.sequence.column = NULL
  )

Arguments

`data`	Optional Argument. Required if you omit "attribute.table" and "response.table" arguments. Specifies the name of the tbl_teradata that contains the input data set.
`attribute.name.columns`	Required Argument. Specifies the names of the attribute tbl_teradata columns that define the attribute. Types: character OR vector of Strings (character)
`attribute.value.column`	Required Argument. Specifies the names of the attribute tbl_teradata column that define the value. Types: character
`id.columns`	Required Argument. Specifies the names of the columns in the response and attribute objects of class "tbl_teradata" that specify the ID of the instance. Types: character OR vector of Strings (character)
`attribute.table`	Optional Argument. Required if you omit "data" argument. Specifies the name of the tbl_teradata that contains the attribute names and the values.
`response.table`	Optional Argument. Required if you omit "data" argument. Specifies the name of the tbl_teradata that contains the response values.
`response.column`	Required Argument. Specifies the name of the response tbl_teradata column that contains the response variable. Types: character
`categorical.attribute.table`	Optional Argument. Specifies the name of the input tbl_teradata containing the categorical attributes.
`splits.table`	Optional Argument. Specifies the name of the input tbl_teradata that contains the user-specified splits. By default, the function creates new splits.
`split.value`	Optional Argument. If you specify the argument "splits.table", this argument specifies the name of the column that contains the split value. If "approx.splits" is TRUE, then the default value is splits_valcol; if not, then the default value is the "attribute.value.column" argument, node_column. Types: character
`num.splits`	Optional Argument. Specifies the number of splits to consider for each variable. The function does not consider all possible splits for all attributes. Default Value: 10 Types: integer
`approx.splits`	Optional Argument. Specifies whether to use approximate percentiles (TRUE) or exact percentiles (FALSE). Internally, the function uses percentile values as split values. Default Value: TRUE Types: logical
`nodesize`	Optional Argument. Specifies the decision tree stopping criterion and the minimum size of any particular node within each decision tree. Default Value: 100 Types: integer
`max.depth`	Optional Argument. Specifies a decision tree stopping criterion. If the tree reaches a depth past this value, the algorithm stops looking for splits. Decision trees can grow up to (2(max_depth+1) - 1) nodes. This stopping criteria has the greatest effect on function performance. The maximum value is 60. Default Value: 30 Types: integer
`weighted`	Optional Argument. Specifies whether to build a weighted decision tree. If you specify "TRUE", then you must also specify the "weight.column" argument. Default Value: FALSE Types: logical
`weight.column`	Optional Argument. Specifies the name of the response tbl_teradata column that contains the weights of the attribute values. Types: character
`split.measure`	Optional Argument. Specifies the impurity measurement to use while constructing the decision tree. Default Value: "gini" Permitted Values: GINI, ENTROPY, CHISQUARE Types: character
`output.response.probdist`	Optional Argument. Specifies a flag to enable or disable output of probability distribution for output labels. Default Value: FALSE Types: logical. Note: This argument argument can accept input value TRUE only when tdplyr is connected to Vantage 1.0 Maintenance Update 2 version or later.
`response.probdist.type`	Optional Argument. Specifies the type of algorithm to use to generate output probability distribution for output labels. Uses one of Laplace, Frequency or RawCounts to generate Probability Estimation Trees (PET) based distributions. Default Value: "Laplace" Permitted Values: Laplace, Frequency, RawCount Types: character Note: This argument can only be used when "output.response.probdist" is set to TRUE.
`categorical.encoding`	Optional Argument. Specifies which encoding method is used for categorical variables. Default Value: "graycode" Permitted Values: graycode, hashing Types: character Note: This argument is supported when tdplyr is connected to Vantage 1.1 or later versions.
`attribute.table.sequence.column`	Optional Argument. Specifies the vector of column(s) that uniquely identifies each row of the input argument "attribute.table". The argument is used to ensure deterministic results for functions which produce results that vary from run to run. Types: character OR vector of Strings (character)
`data.sequence.column`	Optional Argument. Specifies the vector of column(s) that uniquely identifies each row of the input argument "data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run. Types: character OR vector of Strings (character)
`categorical.attribute.table.sequence.column`	Optional Argument. Specifies the vector of column(s) that uniquely identifies each row of the input argument "categorical.attribute.table". The argument is used to ensure deterministic results for functions which produce results that vary from run to run. Types: character OR vector of Strings (character)
`response.table.sequence.column`	Optional Argument. Specifies the vector of column(s) that uniquely identifies each row of the input argument "response.table". The argument is used to ensure deterministic results for functions which produce results that vary from run to run. Types: character OR vector of Strings (character)
`splits.table.sequence.column`	Optional Argument. Specifies the vector of column(s) that uniquely identifies each row of the input argument "splits.table". The argument is used to ensure deterministic results for functions which produce results that vary from run to run. Types: character OR vector of Strings (character)

Value

Function returns an object of class "td_decision_tree_mle" which is a named list containing Teradata tbl objects. Named list members can be referenced directly with the "$" operator using following names:

model.table
intermediate.splits.table
final.response.tableto
output

Examples

  
    # Get the current context/connection
    con <- td_get_context()$connection

    # Load example data.
    loadExampleData("decision_tree_example", "iris_attribute_train", "iris_response_train",
                    "iris_altinput")

    # Create object(s) of class "tbl_teradata".
    iris_attribute_train <- tbl(con, "iris_attribute_train")
    iris_response_train <- tbl(con, "iris_response_train")
    iris_altinput <- tbl(con, "iris_altinput")

    # Example 1 - Create decision tree by specifying attribute and response tables.
    td_decision_tree_out1 <- td_decision_tree_mle(attribute.name.columns = c("attribute"),
                                                  attribute.value.column = "attrvalue",
                                                  id.columns = c("pid"),
                                                  attribute.table = iris_attribute_train,
                                                  response.table = iris_response_train,
                                                  response.column = "response",
                                                  num.splits = 3,
                                                  approx.splits = FALSE,
                                                  nodesize = 10,
                                                  max.depth = 10,
                                                  split.measure = "gini"
                                                  )

    # Example 2 - Create decision tree by specifying only the "data" argument.
    td_decision_tree_out2 <- td_decision_tree_mle(data = iris_altinput,
                                                  attribute.name.columns = c("attribute"),
                                                  attribute.value.column = "attrvalue",
                                                  id.columns = c("pid"),
                                                  response.column = "response",
                                                  num.splits = 3,
                                                  approx.splits = FALSE,
                                                  nodesize = 10,
                                                  max.depth = 10,
                                                  split.measure = "gini"
                                                  )