Teradata R Package Function Reference - DecisionTree - Teradata R Package - Look here for syntax, methods and examples for the functions included in the Teradata R Package.

Teradata® R Package Function Reference

Product
Teradata R Package
Release Number
16.20
Published
February 2020
Language
English (United States)
Last Update
2020-02-28
dita:id
B700-4007
lifecycle
previous
Product Category
Teradata Vantage

Description

The Decision Tree (td_decision_tree_mle) function creates a single decision tree in a distributed fashion, either weighted or unweighted. The model table that this function outputs can be input to the function Decision Tree Predict (td_decision_tree_predict_sqle).

Usage

  td_decision_tree_mle (
      data = NULL,
      attribute.name.columns = NULL,
      attribute.value.column = NULL,
      id.columns = NULL,
      attribute.table = NULL,
      response.table = NULL,
      response.column = NULL,
      categorical.attribute.table = NULL,
      splits.table = NULL,
      split.value = NULL,
      num.splits = 10,
      approx.splits = TRUE,
      nodesize = 100,
      max.depth = 30,
      weighted = FALSE,
      weight.column = NULL,
      split.measure = "gini",
      output.response.probdist = FALSE,
      response.probdist.type = "Laplace",
      categorical.encoding = "graycode",
      attribute.table.sequence.column = NULL,
      data.sequence.column = NULL,
      categorical.attribute.table.sequence.column = NULL,
      response.table.sequence.column = NULL,
      splits.table.sequence.column = NULL
  )

Arguments

data

Optional Argument. Required if you omit "attribute.table" and "response.table" arguments.
Specifies the name of the tbl_teradata that contains the input data set.

attribute.name.columns

Required Argument.
Specifies the names of the attribute tbl_teradata columns that define the attribute.
Types: character OR vector of Strings (character)

attribute.value.column

Required Argument.
Specifies the names of the attribute tbl_teradata column that define the value.
Types: character

id.columns

Required Argument.
Specifies the names of the columns in the response and attribute tbl_teradata objects that specify the ID of the instance.
Types: character OR vector of Strings (character)

attribute.table

Optional Argument. Required if you omit "data" argument.
Specifies the name of the tbl_teradata that contains the attribute names and the values.

response.table

Optional Argument. Required if you omit "data" argument.
Specifies the name of the tbl_teradata that contains the response values.

response.column

Required Argument.
Specifies the name of the response tbl_teradata column that contains the response variable.
Types: character

categorical.attribute.table

Optional Argument.
Specifies the name of input tbl_teradata that contains categorical attributes.

splits.table

Optional Argument.
Specifies the name of the input tbl_teradata that contains the user-specified splits. By default, the function creates new splits.

split.value

Optional Argument.
If you specify the argument "splits.table", this argument specifies the name of the column that contains the split value. If "approx.splits" is TRUE, then the default value is splits_valcol; if not, then the default value is the attribute.value.column argument, node_column.
Types: character

num.splits

Optional Argument.
Specifies the number of splits to consider for each variable. The function does not consider all possible splits for all attributes.
Default Value: 10
Types: numeric

approx.splits

Optional Argument.
Specifies whether to use approximate percentiles (TRUE) or exact percentiles (FALSE).
Default Value: TRUE
Types: logical

nodesize

Optional Argument.
Specifies the decision tree stopping criterion and the minimum size of any particular node within each decision tree.
Default Value: 100
Types: numeric

max.depth

Optional Argument.
Specifies a decision tree stopping criterion. If the tree reaches a depth past this value, the algorithm stops looking for splits. Decision trees can grow up to (2(max_depth+1) - 1) nodes. This stopping criteria has the greatest effect on function performance. The maximum value is 60.
Default Value: 30
Types: numeric

weighted

Optional Argument.
Specifies whether to build a weighted decision tree. If you specify "TRUE", then you must also specify the "weight.column" argument.
Default Value: FALSE
Types: logical

weight.column

Optional Argument.
Specifies the name of the response table column that contains the weights of the attribute values.
Types: character

split.measure

Optional Argument.
Specifies the impurity measurement to use while constructing the decision tree.
Default Value: "gini"
Permitted Values: GINI, ENTROPY, CHISQUARE
Types: character

output.response.probdist

Optional Argument.
Specifies switch to enable or disable output of probability distribution for output labels.
Default Value: FALSE
Types: logical
Note: This argument argument can accept input value TRUE only when tdplyr is connected to Vantage 1.0 Maintenance Update 2 version or later.

response.probdist.type

Optional Argument.
Specifies the type of algorithm to use to generate output probability distribution for output labels.
Default Value: "Laplace"
Permitted Values: Laplace, Frequency, RawCount
Types: character
Note: This argument can only be used when "output.response.probdist" is set to TRUE.

categorical.encoding

Optional Argument.
Specifies which encoding method is used for categorical variables.
Default Value: "graycode"
Permitted Values: graycode, hashing
Types: character
Note: This argument is supported when tdplyr is connected to Vantage 1.1 or later versions.

attribute.table.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "attribute.table". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

data.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

categorical.attribute.table.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "categorical.attribute.table". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

response.table.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "response.table". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

splits.table.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "splits.table". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

Value

Function returns an object of class "td_decision_tree_mle" which is a named list containing Teradata tbl objects. Named list members can be referenced directly with the "$" operator using following names:

  1. model.table

  2. intermediate.splits.table

  3. final.response.tableto

  4. output

Examples

    # Get the current context/connection
    con <- td_get_context()$connection

    # Load example data.
    loadExampleData("decision_tree_example", "iris_attribute_train", "iris_response_train", "iris_altinput")
    
    # Create remote tibble objects.
    iris_attribute_train <- tbl(con, "iris_attribute_train")
    iris_response_train <- tbl(con, "iris_response_train")
    iris_altinput <- tbl(con, "iris_altinput")
    
    # Example 1 - Create decision tree by specifying attribute and response tables.
    td_decision_tree_out1 <- td_decision_tree_mle(attribute.name.columns = c("attribute"),
                                                  attribute.value.column = "attrvalue",
                                                  id.columns = c("pid"),
                                                  attribute.table = iris_attribute_train,
                                                  response.table = iris_response_train,
                                                  response.column = "response",
                                                  num.splits = 3,
                                                  approx.splits = FALSE,
                                                  nodesize = 10,
                                                  max.depth = 10,
                                                  split.measure = "gini"
                                                  )
    
    # Example 2 - Create decision tree by specifying only the "data" argument.
    td_decision_tree_out2 <- td_decision_tree_mle(data = iris_altinput,
                                                  attribute.name.columns = c("attribute"),
                                                  attribute.value.column = "attrvalue",
                                                  id.columns = c("pid"),
                                                  response.column = "response",
                                                  num.splits = 3,
                                                  approx.splits = FALSE,
                                                  nodesize = 10,
                                                  max.depth = 10,
                                                  split.measure = "gini"
                                                  )