Teradata Package for R Function Reference | 17.00 - XGBoost - Teradata Package for R - Look here for syntax, methods and examples for the functions included in the Teradata Package for R.

Teradata® Package for R Function Reference

Product
Teradata Package for R
Release Number
17.00
Published
July 2021
Language
English (United States)
Last Update
2023-08-08
dita:id
B700-4007
NMT
no
Product Category
Teradata Vantage
XGBoost

Description

The XGBoost function takes a training data set and uses gradient boosting to create a strong classifying model that can be input to the function XGBoostPredict (td_xgboost_predict_mle). The function supports input tables in both dense and sparse format.

Usage

  td_xgboost_mle (
      formula = NULL,
      data = NULL,
      id.column = NULL,
      loss.function = "SOFTMAX",
      prediction.type = "CLASSIFICATION",
      reg.lambda = 1,
      shrinkage.factor = 0.1,
      iter.num = 10,
      min.node.size = 1,
      max.depth = 5,
      variance = 0,
      seed = NULL,
      attribute.name.column = NULL,
      num.boosted.trees = NULL,
      attribute.table = NULL,
      attribute.value.column = NULL,
      column.subsampling = 1.0,
      response.column = NULL,
      data.sequence.column = NULL,
      attribute.table.sequence.column = NULL
  )

Arguments

formula

Required Argument when input data is in dense format.
Specifies an object of class "formula". Specifies the model to be fitted. Only basic formula of the (col1 ~ col2 + col3 +...) form are supported and all variables must be from the same tbl_teradata object. The response should be column of type real, numeric, integer or boolean. This argument is not supported for sparse format. For sparse data format provide this information using "attribute.table" argument.
Note: This argument should not be specified along with "response.column".

data

Required Argument.
Specifies the tbl_teradata object containing the input data set. If the input data set is in dense format, the td_xgboost_mle function requires only "data".

id.column

Optional Argument.
Specifies the name of the partitioning column of input tbl_teradata. This column is used as a row identifier to distribute data among different vworkers for parallel boosted trees.
Types: character

loss.function

Optional Argument.
Specifies the learning task and corresponding learning objective.
Default Value: "SOFTMAX"
Permitted Values: BINOMIAL, SOFTMAX
Types: character

prediction.type

Optional Argument.
Specifies whether the function predicts the result from the number of classes ('classification') or from a continuous response variable ('regression'). The function supports only 'classification'.
Default Value: "CLASSIFICATION"
Permitted Values: CLASSIFICATION
Types: character

reg.lambda

Optional Argument.
Specifies the L2 regularization that the loss function uses while boosting trees. The higher the lambda, the stronger the regularization effect.
Default Value: 1
Types: numeric

shrinkage.factor

Optional Argument.
Specifies the learning rate (weight) of a learned tree in each boosting step. After each boosting step, the algorithm multiplies the learner by shrinkage to make the boosting process more conservative. The shrinkage is a DOUBLE PRECISION value in the range [0, 1].
The value 1 specifies no shrinkage.
Default Value: 0.1
Types: numeric

iter.num

Optional Argument.
Specifies the number of iterations to boost the weak classifiers, which is also the number of weak classifiers in the ensemble (T). The number must an in the range [1, 100000].
Default Value: 10
Types: integer

min.node.size

Optional Argument.
Specifies the minimum size of any particular node within each decision tree.
Default Value: 1
Types: integer

max.depth

Optional Argument.
Specifies the maximum depth of the tree. The "max.depth" must be in the range [1, 100000].
Default Value: 12
Types: integer

variance

Optional Argument.
Specifies a decision tree stopping criterion. If the variance within any node dips below this value, the algorithm stops looking for splits in the branch.
Default Value: 0
Types: numeric

seed

Optional Argument.
Specifies the random seed the algorithm uses for repeatable results. If you omit this argument or specify its default value 1, the function uses a faster algorithm but does not ensure repeatability. This argument must be greater than or equal to 1. To ensure repeatability, specify a value greater than 1.
Default Value: 1
Types: numeric

attribute.name.column

Optional Argument.
Required if the input data set is in sparse format. Specifies the column containing the attributes in the input data set.
Types: character

num.boosted.trees

Optional Argument.
Specifies the number of boosted trees to be trained. By default, the number of boosted trees equals the number of vworkers available for the functions.
Types: integer

attribute.table

Optional Argument.
Required argument for sparse data format.
Specifies the name of the tbl_teradata containing the features in the input data.
If the input data set is in sparse format, the function requires both "data" and "attribute.table" arguments.

attribute.value.column

Required if the input data set is in sparse format.
If the data is in the sparse format, this argument indicates the column containing the attribute values in the input tbl_teradata.
Types: character

column.subsampling

Optional Argument.
Specifies the fraction of features to subsample during boosting.
Default Value: 1.0 (no subsampling)
Types: numeric

response.column

Required Argument when "formula" is not specified.
Specifies the name of the input tbl_teradata column that contains the response variable for each data point in the training data set.
Types: character

data.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

attribute.table.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "attribute.table". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

Value

Function returns an object of class "td_xgboost_mle" which is a named list containing objects of class "tbl_teradata".
Named list members can be referenced directly with the "$" operator using following names:

  1. model.table

  2. output

Examples

  
    # Get the current context/connection
    con <- td_get_context()$connection

    # Load example data.
    loadExampleData("xgboost_example", "housing_train_binary", "iris_train", "sparse_iris_train",
                    "sparse_iris_attribute")

    # Example 1: Binary Classification
    # Create object(s) of class "tbl_teradata".
    housing_train_binary <- tbl(con, "housing_train_binary")
    td_xgboost_out1 <- td_xgboost_mle(data=housing_train_binary,
              id.column='sn',
              formula = (homestyle ~ driveway + recroom + fullbase + gashw + airco + prefarea +
                                     price + lotsize + bedrooms + bathrms + stories + garagepl),
              num.boosted.trees=2,
              loss.function='binomial',
              prediction.type='classification',
              reg.lambda=1,
              shrinkage.factor=0.1,
              iter.num=10,
              min.node.size=1,
              max.depth=10
              )


    # Example 2: Multiple-Class Classification
    iris_train <- tbl(con,"iris_train")
    td_xgboost_out2 <- td_xgboost_mle(data=iris_train,
                                  id.column='id',
                                  formula = (species ~ sepal_length + sepal_length +
                                                       petal_length + petal_width + species),
                                  num.boosted.trees=2,
                                  loss.function='softmax',
                                  reg.lambda=1,
                                  shrinkage.factor=0.1,
                                  iter.num=10,
                                  min.node.size=1,
                                  max.depth=10)

     # Example 3: Sparse Input Format. "response.column" argument is specified instead of formula.
     sparse_iris_train <- tbl(con,"sparse_iris_train")
     sparse_iris_attribute <- tbl(con,"sparse_iris_attribute")

     td_xgboost_out3 <- td_xgboost_mle(data=sparse_iris_train,
                attribute.table=sparse_iris_attribute,
                id.column='id',
                attribute.name.column='attribute',
                attribute.value.column='value_col',
                response.column="species",
                loss.function='SOFTMAX',
                reg.lambda=1,
                num.boosted.trees=2,
                shrinkage.factor=0.1,
                column.subsampling=1.0,
                iter.num=10,
                min.node.size=1,
                max.depth=10,
                variance=0,
                seed=1
                )