Teradata R Package Function Reference - AdaBoost - Teradata R Package - Look here for syntax, methods and examples for the functions included in the Teradata R Package.

Teradata® R Package Function Reference

Product
Teradata R Package
Release Number
16.20
Published
February 2020
Language
English (United States)
Last Update
2020-02-28
dita:id
B700-4007
lifecycle
previous
Product Category
Teradata Vantage

Description

The AdaBoost function takes a training data set and a single decision tree and uses adaptive boosting to produce a strong classifying model that can be input to the function AdaBoostPredict (td_adaboost_predict_mle).

Usage

  td_adaboost_mle (
      attribute.data = NULL,
      attribute.name.columns = NULL,
      attribute.value.column = NULL,
      categorical.attribute.data = NULL,
      response.data = NULL,
      id.columns = NULL,
      response.column = NULL,
      iter.num = 20,
      num.splits = 10,
      approx.splits = TRUE,
      split.measure = "gini",
      max.depth = 3,
      min.node.size = 100,
      output.response.probdist = FALSE,
      categorical.encoding = "graycode",
      attribute.data.sequence.column = NULL,
      response.data.sequence.column = NULL,
      categorical.attribute.data.sequence.column = NULL
  )

Arguments

attribute.data

Required Argument.
Specifies the name of the tbl_teradata that contains the attributes and values of the data.

attribute.name.columns

Required Argument.
Specifies the names of attribute tbl_teradata columns that contain the data attributes.
Types: character OR vector of Strings (character)

attribute.value.column

Required Argument.
Specifies the name of attribute tbl_teradata column that contain the data values.
Types: character

categorical.attribute.data

Optional Argument.
Specifies the name of the tbl_teradata that contains the names of the categorical attributes.

response.data

Required Argument.
Specifies the name of the tbl_teradata that contains the responses (labels) of the data.

id.columns

Required Argument.
Specifies the names of the columns in the response and attribute tables that specify the identifier of the instance.
Types: character OR vector of Strings (character)

response.column

Required Argument.
Specifies the name of the response tbl_teradata column that contains the responses (labels) of the data.
Types: character

iter.num

Optional Argument.
Specifies the number of iterations to boost the weak classifiers, which is also the number of weak classifiers in the ensemble of classifiers. The iterations must be in the range [2, 200].
Default Value: 20 Types: numeric

num.splits

Optional Argument.
Specifies the number of splits to try for each attribute in the node splitting.
Default Value: 10
Types: numeric

approx.splits

Optional Argument.
Specifies whether to use approximate percentiles.
Default Value: TRUE
Types: logical

split.measure

Optional Argument.
Specifies the type of measure to use in node splitting. Default Value: "gini"
Permitted Values: gini, entropy Types: character

max.depth

Optional Argument.
Specifies the maximum depth of the tree. The value for max.depth must be in the range [1, 10].
Default Value: 3
Types: numeric

min.node.size

Optional Argument.
Specifies the minimum size of any particular node within each decision tree.
Default Value: 100
Types: numeric

output.response.probdist

Optional Argument.
Specifies the value for enabling or disabling the output of probability distribution for output labels.
Default Value: FALSE
Types: logical

categorical.encoding

Optional Argument.
Specifies the encoding method that is to be used for categorical variables.
Note: categorical.encoding is only available when tdplyr is connected to Vantage 1.1 or later versions.
Default Value: "graycode"
Permitted Values: graycode, hashing
Types: character

attribute.data.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "attribute.data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

response.data.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "response.data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

categorical.attribute.data.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "categorical.attribute.data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

Value

Function returns an object of class "td_adaboost_mle" which is a named list containing Teradata tbl objects.
Named list members can be referenced directly with the "$" operator using following names:

  1. model.table

  2. output

Examples

    # Get the current context/connection
    con <- td_get_context()$connection
    # Load the data to run the example
    loadExampleData("adaboost_example", "housing_train", "housing_cat", "housing_train_response", "iris_attribute_train", "iris_response_train")
    
    # Create remote tibble objects.
    housing_train <- tbl(con, "housing_train")
    housing_cat <- tbl(con, "housing_cat")
    housing_train_response <- tbl(con, "housing_train_response")
    iris_attribute_train <- tbl(con, "iris_attribute_train")
    iris_response_train <- tbl(con, "iris_response_train")
    
    # Example 1 - This example uses home sales data to create a model that predicts home 
    # style when input to td_adaboost_predict_mle.
    #
    # Input description:
    # housing_train               (attribute.data) : tbl_teradata containing real estate
    #                                                sales data. There are six numerical
    #                                                predictors and six categorical
    #                                                predictors. The response variable
    #                                                is 'homestyle'.
    # housing_cat     (categorical.attribute.data) : tbl_teradata that lists all the
    #                                                categorical predictors.
    # housing_response             (response.data) : tbl_teradata that lists the responses
    #                                                for each instance in 'attribute.data' as
    #                                                specified by 'id.columms'.
    td_unpivot_out <- td_unpivot_mle(data = housing_train,
                                     unpivot = c("price", "lotsize", "bedrooms", "bathrms",
                                                 "stories","driveway", "recroom", "fullbase",
                                                 "gashw", "airco", "garagepl", "prefarea"),
                                     accumulate = "sn")
                                      
    td_adaboost_out1 <- td_adaboost_mle(attribute.data = td_unpivot_out$result,
                                        attribute.name.columns = "attribute",
                                        attribute.value.column = "value_col",
                                        categorical.attribute.data = housing_cat,
                                        response.data = housing_train_response,
                                        id.columns = "sn",
                                        response.column = "response",
                                        iter.num = 20,
                                        num.splits = 10,
                                        max.depth = 3,
                                        min.node.size = 100)

    
    # Example 2 - This example uses the iris flower dataset to create a model that predicts
    # the species when input to td_adaboost_predict_mle.
    #
    # Input description:
    # iris_attribute_train  (attribute.data) : tbl_teradata containing the iris flower
    #                                          dataset in the sparse format.
    # iris_response_train    (response.data) : tbl_teradata specifying the response variable
    #                                          for each instance.
    td_adaboost_out2 <- td_adaboost_mle(attribute.data = iris_attribute_train,
                                        attribute.name.columns = "attribute",
                                        attribute.value.column = "attrvalue",
                                        response.data = iris_response_train,
                                        id.columns = "pid",
                                        response.column = "response",
                                        iter.num = 5,
                                        num.splits = 10,
                                        max.depth = 3,
                                        min.node.size = 5,
                                        output.response.probdist = TRUE,
                                        approx.splits = FALSE)