Teradata R Package Function Reference - 16.20 - XGBoost - Teradata R Package

Teradata® R Package Function Reference

prodname
Teradata R Package
vrm_release
16.20
created_date
February 2020
category
Programming Reference
featnum
B700-4007-098K

Description

The XGBoost function takes a training data set and uses gradient boosting to create a strong classifying model that can be input to the function XGBoostPredict. The function supports input tables in both dense and sparse format.

Usage

  td_xgboost_mle (
      formula = NULL,
      data = NULL,
      id.column = NULL,
      loss.function = "SOFTMAX",
      prediction.type = "CLASSIFICATION",
      reg.lambda = 1,
      shrinkage.factor = 0.1,
      iter.num = 10,
      min.node.size = 1,
      max.depth = 12,
      variance = 0,
      seed = 1,
      attribute.name.column = NULL,
      num.boosted.trees = NULL,
      attribute.table = NULL,
      attribute.value.column = NULL,
      column.subsampling = 1.0,
      response.column = NULL,
      data.sequence.column = NULL,
      attribute.table.sequence.column = NULL
  )

Arguments

formula

Required Argument when input data is in dense format.
Specifies an object of class "formula". Specifies the model to be fitted. Only basic formula of the (col1 ~ col2 + col3 +...) form are supported and all variables must be from the same tbl_teradata object. The response should be column of type real, numeric, integer or boolean. This argument is not supported for sparse format. For sparse data format provide this information using "attribute.table" argument.
Note: This argument should not be specified along with "response.column".

data

Required Argument.
Specifies the tbl_teradata object containing the input data set. If the input data set is in dense format, the td_xgboost_mle function requires only "data".

id.column

Optional Argument.
Specifies the name of the partitioning column of input table. This column is used as a row identifier to distribute data among different vworkers for parallel boosted trees.

loss.function

Optional Argument.
Specifies the learning task and corresponding learning objective.
Default Value: "SOFTMAX"
Permitted Values: BINOMIAL, SOFTMAX

prediction.type

Optional Argument.
Specifies whether the function predicts the result from the number of classes ('classification') or from a continuous response variable ('regression'). The function supports only 'classification'.
Default Value: "CLASSIFICATION"
Permitted Values: CLASSIFICATION

reg.lambda

Optional Argument.
Specifies the L2 regularization that the loss function uses while boosting trees. The higher the lambda, the stronger the regularization effect.
Default Value: 1

shrinkage.factor

Optional Argument.
Specifies the learning rate (weight) of a learned tree in each boosting step. After each boosting step, the algorithm multiplies the learner by shrinkage to make the boosting process more conservative. The shrinkage is a DOUBLE PRECISION value in the range [0, 1].
The value 1 specifies no shrinkage.
Default Value: 0.1

iter.num

Optional Argument.
Specifies the number of iterations to boost the weak classifiers, which is also the number of weak classifiers in the ensemble (T). The number must be numeric and in the range [1, 100000].
Default Value: 10

min.node.size

Optional Argument. Specifies the minimum size of any particular node within each decision tree. The min.node.size must be numeric. Default Value: 1

max.depth

Optional Argument.
Specifies the maximum depth of the tree. The max.depth must be numeric and in the range [1, 100000].
Default Value: 12

variance

Optional Argument.
Specifies a decision-tree stopping criterion, the minimum variance for any node. If the variance within any node becomes less than variance, the algorithm stops looking for splits. This argument is a nonnegative DOUBLE PRECISION value.
Default Value: 0

seed

Optional Argument.
Specifies the random seed the algorithm uses for repeatable results. If you omit this argument or specify its default value 1, the function uses a faster algorithm but does not ensure repeatability. This argument must have a LONG value greater than or equal to 1. To ensure repeatability, specify a value greater than 1.
Default Value: 1

attribute.name.column

Optional Argument.
Required if the input data set is in sparse format. Specifies the name of the input table column that contains the names of the attributes of the input data set.

num.boosted.trees

Optional Argument.
Specifies the number of boosted trees to be trained. By default, the number of boosted trees equals the number of vworkers available for the functions.

attribute.table

Optional Argument.
Required argument for sparse data format. Specifies the name of the tbl_teradata containing the features in the input data.
If the input data set is in sparse format, the function requires both "data" and "attribute.table" arguments.

attribute.value.column

Optional Argument.
Required if the input data set is in sparse format.
Specifies the name of the input table column that contains the values of the attributes of the input data set.

column.subsampling

Optional Argument.
Specifies the fraction of features to subsample during boosting.
Default Value: 1.0 (no subsampling)

response.column

Required Argument when "formula" is not specified.
Specifies the name of the input table column that contains the response variable for each data point in the training data set.

data.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.

attribute.table.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "attribute.table". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.

Value

Function returns an object of class "td_xgboost_mle" which is a named list containing Teradata tbl objects.
Named list members can be referenced directly with the "$" operator using following names:

  1. model.table

  2. output

Examples

    # Get the current context/connection
    con <- td_get_context()$connection

    # Load example data.
    loadExampleData("xgboost_example", "housing_train_binary","iris_train","sparse_iris_train","sparse_iris_attribute")
    
    #Example 1: Binary Classification
    # Create remote tibble objects.
    housing_train_binary <- tbl(con, "housing_train_binary")
    td_xgboost_out1 <- td_xgboost_mle(data=housing_train_binary,
              id.column='sn',
              formula = ( homestyle ~ driveway + recroom + fullbase + gashw + airco + prefarea + price + lotsize + bedrooms + bathrms + stories + garagepl ),
              num.boosted.trees=2,
              loss.function='binomial',
              prediction.type='classification',
              reg.lambda=1,
              shrinkage.factor=0.1,
              iter.num=10,
              min.node.size=1,
              max.depth=10
              )
              
    
    #Example 2: Multiple-Class Classification
    iris_train <- tbl(con,"iris_train")
    td_xgboost_out2 <- td_xgboost_mle(data=iris_train,
                                  id.column='id',
                                  formula = ( species ~ sepal_length + sepal_length + petal_length + petal_width + species),
                                  num.boosted.trees=2,
                                  loss.function='softmax',
                                  reg.lambda=1,
                                  shrinkage.factor=0.1,
                                  iter.num=10,
                                  min.node.size=1,
                                  max.depth=10)
                                  
     #Example 3: Sparse Input Format. response.column argument is specified instead of formula.
     sparse_iris_train <- tbl(con,"sparse_iris_train")
     sparse_iris_attribute <- tbl(con,"sparse_iris_attribute")
     
     td_xgboost_out3 <- td_xgboost_mle(data=sparse_iris_train,
                attribute.table=sparse_iris_attribute,
                id.column='id',
                attribute.name.column='attribute',
                attribute.value.column='value_col',
                response.column="species",
                loss.function='SOFTMAX',
                reg.lambda=1,
                num.boosted.trees=2,
                shrinkage.factor=0.1,
                column.subsampling=1.0,
                iter.num=10,
                min.node.size=1,
                max.depth=10,
                variance=0,
                seed=1
                )