Teradata Package for R Function Reference | 17.00 - XGBoost - Teradata Package for R - Look here for syntax, methods and examples for the functions included in the Teradata Package for R.

XGBoost

Description

The XGBoost function takes a training data set and uses gradient boosting to create a strong classifying model that can be input to the function XGBoostPredict (td_xgboost_predict_mle). The function supports input tables in both dense and sparse format.

Usage

  td_xgboost_mle (
      formula = NULL,
      data = NULL,
      id.column = NULL,
      loss.function = "SOFTMAX",
      prediction.type = "CLASSIFICATION",
      reg.lambda = 1,
      shrinkage.factor = 0.1,
      iter.num = 10,
      min.node.size = 1,
      max.depth = 5,
      variance = 0,
      seed = NULL,
      attribute.name.column = NULL,
      num.boosted.trees = NULL,
      attribute.table = NULL,
      attribute.value.column = NULL,
      column.subsampling = 1.0,
      response.column = NULL,
      data.sequence.column = NULL,
      attribute.table.sequence.column = NULL
  )

Arguments

`formula`	Required Argument when input data is in dense format. Specifies an object of class "formula". Specifies the model to be fitted. Only basic formula of the (col1 ~ col2 + col3 +...) form are supported and all variables must be from the same tbl_teradata object. The response should be column of type real, numeric, integer or boolean. This argument is not supported for sparse format. For sparse data format provide this information using "attribute.table" argument. Note: This argument should not be specified along with "response.column".
`data`	Required Argument. Specifies the tbl_teradata object containing the input data set. If the input data set is in dense format, the td_xgboost_mle function requires only "data".
`id.column`	Optional Argument. Specifies the name of the partitioning column of input tbl_teradata. This column is used as a row identifier to distribute data among different vworkers for parallel boosted trees. Types: character
`loss.function`	Optional Argument. Specifies the learning task and corresponding learning objective. Default Value: "SOFTMAX" Permitted Values: BINOMIAL, SOFTMAX Types: character
`prediction.type`	Optional Argument. Specifies whether the function predicts the result from the number of classes ('classification') or from a continuous response variable ('regression'). The function supports only 'classification'. Default Value: "CLASSIFICATION" Permitted Values: CLASSIFICATION Types: character
`reg.lambda`	Optional Argument. Specifies the L2 regularization that the loss function uses while boosting trees. The higher the lambda, the stronger the regularization effect. Default Value: 1 Types: numeric
`shrinkage.factor`	Optional Argument. Specifies the learning rate (weight) of a learned tree in each boosting step. After each boosting step, the algorithm multiplies the learner by shrinkage to make the boosting process more conservative. The shrinkage is a DOUBLE PRECISION value in the range [0, 1]. The value 1 specifies no shrinkage. Default Value: 0.1 Types: numeric
`iter.num`	Optional Argument. Specifies the number of iterations to boost the weak classifiers, which is also the number of weak classifiers in the ensemble (T). The number must an in the range [1, 100000]. Default Value: 10 Types: integer
`min.node.size`	Optional Argument. Specifies the minimum size of any particular node within each decision tree. Default Value: 1 Types: integer
`max.depth`	Optional Argument. Specifies the maximum depth of the tree. The "max.depth" must be in the range [1, 100000]. Default Value: 12 Types: integer
`variance`	Optional Argument. Specifies a decision tree stopping criterion. If the variance within any node dips below this value, the algorithm stops looking for splits in the branch. Default Value: 0 Types: numeric
`seed`	Optional Argument. Specifies the random seed the algorithm uses for repeatable results. If you omit this argument or specify its default value 1, the function uses a faster algorithm but does not ensure repeatability. This argument must be greater than or equal to 1. To ensure repeatability, specify a value greater than 1. Default Value: 1 Types: numeric
`attribute.name.column`	Optional Argument. Required if the input data set is in sparse format. Specifies the column containing the attributes in the input data set. Types: character
`num.boosted.trees`	Optional Argument. Specifies the number of boosted trees to be trained. By default, the number of boosted trees equals the number of vworkers available for the functions. Types: integer
`attribute.table`	Optional Argument. Required argument for sparse data format. Specifies the name of the tbl_teradata containing the features in the input data. If the input data set is in sparse format, the function requires both "data" and "attribute.table" arguments.
`attribute.value.column`	Required if the input data set is in sparse format. If the data is in the sparse format, this argument indicates the column containing the attribute values in the input tbl_teradata. Types: character
`column.subsampling`	Optional Argument. Specifies the fraction of features to subsample during boosting. Default Value: 1.0 (no subsampling) Types: numeric
`response.column`	Required Argument when "formula" is not specified. Specifies the name of the input tbl_teradata column that contains the response variable for each data point in the training data set. Types: character
`data.sequence.column`	Optional Argument. Specifies the vector of column(s) that uniquely identifies each row of the input argument "data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run. Types: character OR vector of Strings (character)
`attribute.table.sequence.column`	Optional Argument. Specifies the vector of column(s) that uniquely identifies each row of the input argument "attribute.table". The argument is used to ensure deterministic results for functions which produce results that vary from run to run. Types: character OR vector of Strings (character)

Value

Function returns an object of class "td_xgboost_mle" which is a named list containing objects of class "tbl_teradata".
Named list members can be referenced directly with the "$" operator using following names:

model.table
output

Examples

  
    # Get the current context/connection
    con <- td_get_context()$connection

    # Load example data.
    loadExampleData("xgboost_example", "housing_train_binary", "iris_train", "sparse_iris_train",
                    "sparse_iris_attribute")

    # Example 1: Binary Classification
    # Create object(s) of class "tbl_teradata".
    housing_train_binary <- tbl(con, "housing_train_binary")
    td_xgboost_out1 <- td_xgboost_mle(data=housing_train_binary,
              id.column='sn',
              formula = (homestyle ~ driveway + recroom + fullbase + gashw + airco + prefarea +
                                     price + lotsize + bedrooms + bathrms + stories + garagepl),
              num.boosted.trees=2,
              loss.function='binomial',
              prediction.type='classification',
              reg.lambda=1,
              shrinkage.factor=0.1,
              iter.num=10,
              min.node.size=1,
              max.depth=10
              )


    # Example 2: Multiple-Class Classification
    iris_train <- tbl(con,"iris_train")
    td_xgboost_out2 <- td_xgboost_mle(data=iris_train,
                                  id.column='id',
                                  formula = (species ~ sepal_length + sepal_length +
                                                       petal_length + petal_width + species),
                                  num.boosted.trees=2,
                                  loss.function='softmax',
                                  reg.lambda=1,
                                  shrinkage.factor=0.1,
                                  iter.num=10,
                                  min.node.size=1,
                                  max.depth=10)

     # Example 3: Sparse Input Format. "response.column" argument is specified instead of formula.
     sparse_iris_train <- tbl(con,"sparse_iris_train")
     sparse_iris_attribute <- tbl(con,"sparse_iris_attribute")

     td_xgboost_out3 <- td_xgboost_mle(data=sparse_iris_train,
                attribute.table=sparse_iris_attribute,
                id.column='id',
                attribute.name.column='attribute',
                attribute.value.column='value_col',
                response.column="species",
                loss.function='SOFTMAX',
                reg.lambda=1,
                num.boosted.trees=2,
                shrinkage.factor=0.1,
                column.subsampling=1.0,
                iter.num=10,
                min.node.size=1,
                max.depth=10,
                variance=0,
                seed=1
                )