Teradata R Package Function Reference | 17.00 - 17.00 - AdaBoost - Teradata R Package

Teradata® R Package Function Reference

prodname
Teradata R Package
vrm_release
17.00
created_date
September 2020
category
Programming Reference
featnum
B700-4007-090K

Description

The AdaBoost function takes a training data set and a single decision tree and uses adaptive boosting to produce a strong classifying model that can be input to the function AdaBoostPredict (td_adaboost_predict_mle).

Usage

  td_adaboost_mle (
      attribute.data = NULL,
      attribute.name.columns = NULL,
      attribute.value.column = NULL,
      categorical.attribute.data = NULL,
      response.data = NULL,
      id.columns = NULL,
      response.column = NULL,
      iter.num = 20,
      num.splits = 10,
      approx.splits = TRUE,
      split.measure = "gini",
      max.depth = 3,
      min.node.size = 100,
      output.response.probdist = FALSE,
      categorical.encoding = "graycode",
      attribute.data.sequence.column = NULL,
      response.data.sequence.column = NULL,
      categorical.attribute.data.sequence.column = NULL
  )

Arguments

attribute.data

Required Argument.
Specifies the name of the tbl_teradata that contains the attributes and values of the data.

attribute.name.columns

Required Argument.
Specifies the columns of attribute tbl_teradata that contain the data attributes.
Types: character OR vector of Strings (character)

attribute.value.column

Required Argument.
Specifies the columns of attribute tbl_teradata that contain the data values.
Types: character

categorical.attribute.data

Optional Argument.
Specifies the name of the tbl_teradata that contains the names of the categorical attributes.

response.data

Required Argument.
Specifies the name of the tbl_teradata that contains the responses (labels) of the data.

id.columns

Required Argument.
Specifies the names of the columns in the response and attribute tables that specify the identifier of the instance.
Types: character OR vector of Strings (character)

response.column

Required Argument.
Specifies the name of the response tbl_teradata column that contains the responses (labels) of the data.
Types: character

iter.num

Optional Argument.
Specifies the number of iterations to boost the weak classifiers, which is also the number of weak classifiers in the ensemble of classifiers. The iterations must an integer in the range [2, 200].
Default Value: 20
Types: integer

num.splits

Optional Argument.
Specifies the number of splits to try for each attribute in the node splitting.
Default Value: 10
Types: integer

approx.splits

Optional Argument.
Specifies whether to use approximate percentiles.
Default Value: TRUE
Types: logical

split.measure

Optional Argument.
Specifies the type of measure to use in node splitting.
Default Value: "gini"
Permitted Values: GINI, ENTROPY
Types: character

max.depth

Optional Argument.
Specifies the maximum depth of the tree. The "max.depth" must be in the range [1, 10].
Default Value: 3
Types: integer

min.node.size

Optional Argument.
Specifies the minimum size of any particular node within each decision tree.
Default Value: 100
Types: integer

output.response.probdist

Optional Argument.
Specifies a flag to enable or disable output of probability distribution for output labels.
Default Value: FALSE
Types: logical

categorical.encoding

Optional Argument.
Specifies the encoding method to be used for categorical variables.
Note: "categorical.encoding" is only available when tdplyr is connected to Vantage 1.1 or later versions.
Default Value: "graycode"
Permitted Values: graycode, hashing
Types: character

attribute.data.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "attribute.data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

response.data.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "response.data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

categorical.attribute.data.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "categorical.attribute.data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

Value

Function returns an object of class "td_adaboost_mle" which is a named list containing objects of class "tbl_teradata".
Named list members can be referenced directly with the "$" operator using following names:

  1. model.table

  2. output

Examples

    # Get the current context/connection
    con <- td_get_context()$connection
    # Load the data to run the example
    loadExampleData("adaboost_example", "housing_train", "housing_cat", "housing_train_response", "iris_attribute_train", "iris_response_train")
    
    # Create object(s) of class "tbl_teradata".
    housing_train <- tbl(con, "housing_train")
    housing_cat <- tbl(con, "housing_cat")
    housing_train_response <- tbl(con, "housing_train_response")
    iris_attribute_train <- tbl(con, "iris_attribute_train")
    iris_response_train <- tbl(con, "iris_response_train")
    
    # Example 1 - This example uses home sales data to create a model that predicts home 
    # style when input to td_adaboost_predict_mle() function.
    #
    # Input description:
    # housing_train               (attribute.data) : tbl_teradata containing real estate
    #                                                sales data. There are six numerical
    #                                                predictors and six categorical
    #                                                predictors. The response variable
    #                                                is 'homestyle'.
    # housing_cat     (categorical.attribute.data) : tbl_teradata that lists all the
    #                                                categorical predictors.
    # housing_response             (response.data) : tbl_teradata that lists the responses
    #                                                for each instance in 'attribute.data' as
    #                                                specified by 'id.columms'.
    td_unpivot_out <- td_unpivot_mle(data = housing_train,
                                     unpivot = c("price", "lotsize", "bedrooms", "bathrms",
                                                 "stories","driveway", "recroom", "fullbase",
                                                 "gashw", "airco", "garagepl", "prefarea"),
                                     accumulate = "sn")
                                      
    td_adaboost_out1 <- td_adaboost_mle(attribute.data = td_unpivot_out$result,
                                        attribute.name.columns = "attribute",
                                        attribute.value.column = "value_col",
                                        categorical.attribute.data = housing_cat,
                                        response.data = housing_train_response,
                                        id.columns = "sn",
                                        response.column = "response",
                                        iter.num = 20,
                                        num.splits = 10,
                                        max.depth = 3,
                                        min.node.size = 100)

    
    # Example 2 - This example uses the iris flower dataset to create a model that predicts
    # the species when input to td_adaboost_predict_mle().
    #
    # Input description:
    # iris_attribute_train  (attribute.data) : tbl_teradata containing the iris flower
    #                                          dataset in the sparse format.
    # iris_response_train    (response.data) : tbl_teradata specifying the response variable
    #                                          for each instance.
    td_adaboost_out2 <- td_adaboost_mle(attribute.data = iris_attribute_train,
                                        attribute.name.columns = "attribute",
                                        attribute.value.column = "attrvalue",
                                        response.data = iris_response_train,
                                        id.columns = "pid",
                                        response.column = "response",
                                        iter.num = 5,
                                        num.splits = 10,
                                        max.depth = 3,
                                        min.node.size = 5,
                                        output.response.probdist = TRUE,
                                        approx.splits = FALSE)