Teradata Package for R Function Reference | 17.00 - AdaBoost - Teradata Package for R - Look here for syntax, methods and examples for the functions included in the Teradata Package for R.

AdaBoost

Description

The AdaBoost function takes a training data set and a single decision tree and uses adaptive boosting to produce a strong classifying model that can be input to the function AdaBoostPredict (td_adaboost_predict_mle).

Usage

  td_adaboost_mle (
      attribute.data = NULL,
      attribute.name.columns = NULL,
      attribute.value.column = NULL,
      categorical.attribute.data = NULL,
      response.data = NULL,
      id.columns = NULL,
      response.column = NULL,
      iter.num = 20,
      num.splits = 10,
      approx.splits = TRUE,
      split.measure = "gini",
      max.depth = 3,
      min.node.size = 100,
      output.response.probdist = FALSE,
      categorical.encoding = "graycode",
      attribute.data.sequence.column = NULL,
      response.data.sequence.column = NULL,
      categorical.attribute.data.sequence.column = NULL
  )

Arguments

`attribute.data`	Required Argument. Specifies the name of the tbl_teradata that contains the attributes and values of the data.
`attribute.name.columns`	Required Argument. Specifies the columns of attribute tbl_teradata that contain the data attributes. Types: character OR vector of Strings (character)
`attribute.value.column`	Required Argument. Specifies the columns of attribute tbl_teradata that contain the data values. Types: character
`categorical.attribute.data`	Optional Argument. Specifies the name of the tbl_teradata that contains the names of the categorical attributes.
`response.data`	Required Argument. Specifies the name of the tbl_teradata that contains the responses (labels) of the data.
`id.columns`	Required Argument. Specifies the names of the columns in the response and attribute tables that specify the identifier of the instance. Types: character OR vector of Strings (character)
`response.column`	Required Argument. Specifies the name of the response tbl_teradata column that contains the responses (labels) of the data. Types: character
`iter.num`	Optional Argument. Specifies the number of iterations to boost the weak classifiers, which is also the number of weak classifiers in the ensemble of classifiers. The iterations must an integer in the range [2, 200]. Default Value: 20 Types: integer
`num.splits`	Optional Argument. Specifies the number of splits to try for each attribute in the node splitting. Default Value: 10 Types: integer
`approx.splits`	Optional Argument. Specifies whether to use approximate percentiles. Default Value: TRUE Types: logical
`split.measure`	Optional Argument. Specifies the type of measure to use in node splitting. Default Value: "gini" Permitted Values: GINI, ENTROPY Types: character
`max.depth`	Optional Argument. Specifies the maximum depth of the tree. The "max.depth" must be in the range [1, 10]. Default Value: 3 Types: integer
`min.node.size`	Optional Argument. Specifies the minimum size of any particular node within each decision tree. Default Value: 100 Types: integer
`output.response.probdist`	Optional Argument. Specifies a flag to enable or disable output of probability distribution for output labels. Default Value: FALSE Types: logical
`categorical.encoding`	Optional Argument. Specifies the encoding method to be used for categorical variables. Note: "categorical.encoding" is only available when tdplyr is connected to Vantage 1.1 or later versions. Default Value: "graycode" Permitted Values: graycode, hashing Types: character
`attribute.data.sequence.column`	Optional Argument. Specifies the vector of column(s) that uniquely identifies each row of the input argument "attribute.data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run. Types: character OR vector of Strings (character)
`response.data.sequence.column`	Optional Argument. Specifies the vector of column(s) that uniquely identifies each row of the input argument "response.data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run. Types: character OR vector of Strings (character)
`categorical.attribute.data.sequence.column`	Optional Argument. Specifies the vector of column(s) that uniquely identifies each row of the input argument "categorical.attribute.data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run. Types: character OR vector of Strings (character)

Value

Function returns an object of class "td_adaboost_mle" which is a named list containing objects of class "tbl_teradata".
Named list members can be referenced directly with the "$" operator using following names:

model.table
output

Examples

  
    # Get the current context/connection
    con <- td_get_context()$connection
    # Load the data to run the example
    loadExampleData("adaboost_example", "housing_train", "housing_cat", "housing_train_response",
                    "iris_attribute_train", "iris_response_train")

    # Create object(s) of class "tbl_teradata".
    housing_train <- tbl(con, "housing_train")
    housing_cat <- tbl(con, "housing_cat")
    housing_train_response <- tbl(con, "housing_train_response")
    iris_attribute_train <- tbl(con, "iris_attribute_train")
    iris_response_train <- tbl(con, "iris_response_train")

    # Example 1 - This example uses home sales data to create a model that predicts home
    # style when input to td_adaboost_predict_mle() function.
    #
    # Input description:
    # housing_train               (attribute.data) : tbl_teradata containing real estate
    #                                                sales data. There are six numerical
    #                                                predictors and six categorical
    #                                                predictors. The response variable
    #                                                is 'homestyle'.
    # housing_cat     (categorical.attribute.data) : tbl_teradata that lists all the
    #                                                categorical predictors.
    # housing_response             (response.data) : tbl_teradata that lists the responses
    #                                                for each instance in 'attribute.data' as
    #                                                specified by 'id.columms'.
    td_unpivot_out <- td_unpivot_mle(data = housing_train,
                                     unpivot = c("price", "lotsize", "bedrooms", "bathrms",
                                                 "stories","driveway", "recroom", "fullbase",
                                                 "gashw", "airco", "garagepl", "prefarea"),
                                     accumulate = "sn")

    td_adaboost_out1 <- td_adaboost_mle(attribute.data = td_unpivot_out$result,
                                        attribute.name.columns = "attribute",
                                        attribute.value.column = "value_col",
                                        categorical.attribute.data = housing_cat,
                                        response.data = housing_train_response,
                                        id.columns = "sn",
                                        response.column = "response",
                                        iter.num = 20,
                                        num.splits = 10,
                                        max.depth = 3,
                                        min.node.size = 100)


    # Example 2 - This example uses the iris flower dataset to create a model that predicts
    # the species when input to td_adaboost_predict_mle().
    #
    # Input description:
    # iris_attribute_train  (attribute.data) : tbl_teradata containing the iris flower
    #                                          dataset in the sparse format.
    # iris_response_train    (response.data) : tbl_teradata specifying the response variable
    #                                          for each instance.
    td_adaboost_out2 <- td_adaboost_mle(attribute.data = iris_attribute_train,
                                        attribute.name.columns = "attribute",
                                        attribute.value.column = "attrvalue",
                                        response.data = iris_response_train,
                                        id.columns = "pid",
                                        response.column = "response",
                                        iter.num = 5,
                                        num.splits = 10,
                                        max.depth = 3,
                                        min.node.size = 5,
                                        output.response.probdist = TRUE,
                                        approx.splits = FALSE)