Teradata Package for R Function Reference | 17.20 - XGBoost - Teradata Package for R - Look here for syntax, methods and examples for the functions included in the Teradata Package for R.

Teradata® Package for R Function Reference

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Teradata Package for R
Release Number
17.20
Published
March 2024
ft:locale
en-US
ft:lastEdition
2024-05-03
dita:id
TeradataR_FxRef_Enterprise_1720
lifecycle
latest
Product Category
Teradata Vantage

XGBoost

Description

The td_xgboost_sqle() function, also known as extreme Gradient Boosting, is an implementation of the gradient boosted decision tree algorithm designed for speed and performance.
It has recently been dominating applied machine learning.

In gradient boosting, each iteration fits a model to the residuals (errors) of the previous iteration to correct the errors made by existing models. The predicted residual is multiplied by this learning rate and then added to the previous prediction. Models are added sequentially until no further improvements can be made.
It is called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models.

Gradient boosting involves three elements:

  • A loss function to be optimized.

  • A weak learner to make predictions.

  • An additive model to add weak learners to minimize the loss function.

The loss function used depends on the type of problem being solved. For example, regression may use a squared error and binary classification may use binomial. A benefit of the gradient boosting is that a new boosting algorithm does not have to be derived for each loss function.
Instead, it provides a generic enough framework that any differentiable loss function can be used. The td_xgboost_sqle() function supports both regression and classification predictive modeling problems. The model that it creates is used in the td_xgboost_predict_sqle() function for making predictions.

The td_xgboost_sqle() function supports the following features.

  • Regression

  • Multiple-Class and binary classification

Notes:

  • When a dataset is small, best practice is to distribute the data to one AMP. To do this, create an identifier column as a primary index, and use the same value for each row.

  • For Classification (softmax), a maximum of 500 classes are supported.

  • For Classification, while the creating tbl_teradata for the function input, the tbl_teradata column must have a deterministic output. Otherwise, the function may not run successfully or return the correct output.

  • The processing time is controlled by (proportional to): * The number of boosted trees (controlled by "num_boosted_trees", "tree_size", and "coverage.factor"). * The number of iterations (sub-trees) in each boosted tree (controlled by "iternum"). * The complexity of an iteration (controlled by "max.depth", "min_nod_size", "column.sampling", "min.impurity"). A careful choice of these parameters can be used to control the processing time. For example, changing "coverage.factor" from 1.0 to 2.0 doubles the number of boosted trees, which as a result, doubles the execution time roughly.

Usage

  td_xgboost_sqle (
      formula = NULL,
      data = NULL,
      input.columns = NULL,
      response.column = NULL,
      max.depth = 5,
      min.node.size = 1,
      seed = 1,
      model.type = 'REGRESSION',
      coverage.factor = 1.0,
      min.impurity = 0.0,
      lambda1 = 100000,
      shrinkage.factor = 0.1,
      column.sampling = 1.0,
      iter.num = 10,
      ...
  )

Arguments

formula

Required Argument when "input.columns" and "response.column" are not provided, optional otherwise.
Specifies a string consisting of "formula" which is the model to be fitted.
Only basic formula of the "col1 ~ col2 + col3 +..." form are supported and all variables must be from the tbl_teradata.
Notes:

  • The function only accepts numeric features. User must convert the categorical features to numeric values, before passing to the formula.

  • In case categorical features are passed to formula, those are ignored, and only numeric features are considered.

  • Provide either "formula" argument or "input.columns" and "response.column" arguments.

Types: character

data

Required Argument.
Specifies the tbl_teradata containing the input data.
Types: tbl_teradata

input.columns

Required Argument when "formula" is not provided, optional otherwise.
Specifies the name(s) of the tbl_teradata column(s) that need to be used for training the model (predictors, features, or independent variables).
Note:

  • Input column names with double quotation marks are not allowed for this function.

Types: character OR vector of Strings (character)

response.column

Required Argument when "formula" is not provided, optional otherwise.
Specifies the name of the column that contains the class label for classification or target value (dependent variable) for regression.
Types: character

max.depth

Optional Argument.
Specifies a decision tree stopping criterion. If the tree reaches a depth past this value, the algorithm stops looking for splits.
Decision trees can grow to (2^(max_depth+1)-1) nodes. This stopping criterion has the greatest effect on the performance of the function.
The maximum value is 2147483647.
Note:

  • The "max.depth" must be in the range [1, 2147483647].

Default Value: 5
Types: integer

num_boosted_trees:
Optional Argument.
Specifies the number of parallels boosted trees. Each boosted tree operates on a sample of data that fits in an AMP memory. By default, it is chosen equal to the number of AMPs with data. If "num_boosted_trees" is greater than the number of AMPs with data, each boosting operates on a sample of the input data, and the function estimates sample size (number of rows) using this formula:
sample_size = total_number_of_input_rows / number_of_trees The sample_size must fit in an AMP memory. It always uses the sample size (or tree size) that fits in an AMP memory to build tree models and ignores those rows cannot fit in memory. A higher "num_boosted_trees" value may improve function run time but may decrease prediction accuracy.
Note:

  • The "num_boosted_trees" must be in the range [-1, 10000]

Default Value: -1
Types: integer

min.node.size

Optional Argument.
Specifies a decision tree stopping criterion, which is the minimum size of any node within each decision tree.
Note:

  • The "min.node.size" must be in the range [1, 2147483647].

Default Value: 1
Types: integer

seed

Optional Argument.
Specifies an integer value to use in determining the random seed for column sampling.
Note:

  • The "seed" must be in the range [-2147483648, 2147483647].

Default Value: 1
Types: integer

model.type

Optional Argument.
Specifies whether the analysis is a regression (continuous response variable) or a multiple-class classification (predicting result from the number of classes).
Default Value: Regression
Permitted Values:

  • Regression

  • Classification

Types: character

coverage.factor

Optional Argument.
Specifies the level of coverage for the dataset while boosting trees (in percentage, for example, 1.25 = 125 can only be used if "num_boosted_trees" is not supplied. When "num_boosted_trees" is specified, coverage depends on the value of "num_boosted_trees".
If "num_boosted_trees" is not specified, "num_boosted_trees" is chosen to achieve this level of coverage specified by "coverage.factor".
Note:

  • The "seed" must be in the range (0, 10.0].

Default Value: 1.0
Types: float OR integer

min.impurity

Optional Argument.
Specifies the minimum impurity at which the tree stops splitting further down.
For regression, a criteria of squared error is used, whereas for classification, gini impurity is used.
Note:

  • The "min.impurity" must be in the range [0.0, 1.79769313486231570815e+308].

Default Value: 0.0
Types: float OR integer

lambda1

Optional Argument.
Specifies the L2 regularization that the loss function uses while boosting trees.
The higher the lambda, the stronger the regularization effect.
Notes:

  • The "lambda1" must be in the range [0, 100000].

  • The value 0 specifies no regularization.

Default Value: 100000
Types: float OR integer

shrinkage.factor

Optional Argument.
Specifies the learning rate (weight) of a learned tree in each boosting step.
After each boosting step, the algorithm multiplies the learner by shrinkage to factor make the boosting process more conservative.
Notes:

  • The "shrinkage.factor" is a DOUBLE PRECISION value in the range (0, 1].

  • The value 1 specifies no shrinkage.

Default Value: 0.1
Types: float

column.sampling

Optional Argument.
Specifies the fraction of features to sample during boosting.
Note:

  • The "column.sampling" must be in the range (0, 1].

Default Value: 1.0
Types: float

iter.num

Optional Argument.
Specifies the number of iterations (rounds) to boost the weak classifiers.
Note:

  • The "iter.num" must be in the range [1, 100000].

Default Value: 10
Types: integer

tree_size:
Optional Argument.
Specifies the number of rows that each tree uses as its input data set.
The function builds a tree using either the smaller of the number of rows on an AMP and the number of rows that fit into the AMP memory, or the number of rows given by the "tree_size" argument. By default, this argument takes the smaller of the number of rows on an AMP and the number of rows that fit into the AMP memory.
Note:

  • The "tree_size" must be in the range [-1, 2147483647].

Default Value: -1
Types: integer

...

Specifies the generic keyword arguments SQLE functions accept. Below
are the generic keyword arguments:

persist:
Optional Argument.
Specifies whether to persist the results of the
function in a table or not. When set to TRUE, results are persisted in a table; otherwise, results are garbage collected at the end of the session.
Default Value: FALSE
Types: logical

volatile:
Optional Argument.
Specifies whether to put the results of the
function in a volatile table or not. When set to TRUE, results are stored in a volatile table, otherwise not.
Default Value: FALSE
Types: logical

Function allows the user to partition, hash, order or local order the input data. These generic arguments are available for each argument that accepts tbl_teradata as input and can be accessed as:

  • "<input.data.arg.name>.partition.column" accepts character or vector of character (Strings)

  • "<input.data.arg.name>.hash.column" accepts character or vector of character (Strings)

  • "<input.data.arg.name>.order.column" accepts character or vector of character (Strings)

  • "local.order.<input.data.arg.name>" accepts logical

Note:
These generic arguments are supported by tdplyr if the underlying SQLE Engine function supports, else an exception is raised.

Value

Function returns an object of class "td_xgboost_sqle" which is a named list containing object of class "tbl_teradata".
Named list member(s) can be referenced directly with the "$" operator using the name(s):

  1. result

  2. output.data

Examples

  
    
    # Get the current context/connection.
    con <- td_get_context()$connection
    
    # Load the example data.
    loadExampleData("tdplyr_example", "titanic", "iris_input")
    
    # Create tbl_teradata object.
    titanic <- tbl(con, "titanic")
    iris_input <- tbl(con, "iris_input")
    
    # Check the list of available analytic functions.
    display_analytic_functions()
    
    # Example 1: Train the model using features 'age', 'survived' and 'pclass'
    #            whereas target value as 'fare'.
    XGBoost_out_1 <- td_xgboost_sqle(
                            data=titanic,
                            input.columns=c("age", "survived", "pclass"),
                            response.column = 'fare',
                            max.depth=3,
                            lambda1 = 1000.0,
                            model.type='Regression',
                            seed=-1,
                            shrinkage.factor=0.1,
                            iter.num=2)
    
    # Print the result.
    print(XGBoost_out_1$result)
    print(XGBoost_out_1$output.data)
    
    # Example 2: Improve the function run time by specifying "num_boosted_trees"
    #            value greater than the number of AMPs.
    XGBoost_out_2 <- td_xgboost_sqle(
                            data=titanic,
                            input.columns=c("age", "survived", "pclass"),
                            response.column = 'fare',
                            max.depth=3,
                            lambda1 = 1000.0,
                            model.type='Regression',
                            seed=-1,
                            shrinkage.factor=0.1,
                            num.boosted.tres=10,
                            iter.num=2)
    
    # Print the result.
    print(XGBoost_out_2$result)
    print(XGBoost_out_2$output.data)
    
    # Example 3: Train the model using titanic input and provided the "formula".
    formula <- fare ~ age + survived + pclass
    XGBoost_out_3 <- td_xgboost_sqle(
                            data=titanic,
                            formula=formula,
                            max.depth=3,
                            lambda1 = 10000.0,
                            model.type='Regression',
                            seed=-1,
                            shrinkage.factor=0.1,
                            iter.num=2)
    
    # Print the result.
    print(XGBoost_out_3$result)
    print(XGBoost_out_3$output.data)
    
    # Example 4: Train the model using features 'sepal_length', 'sepal_width',
    #            'petal_length', 'petal_width' whereas target value as 'species'
    #             and model type as 'Classification'.
    XGBoost_out_4 <- td_xgboost_sqle(
                            data=iris_input,
                            input.columns=c('sepal_length', 'sepal_width',
                                            'petal_length', 'petal_width'),
                            response.column = 'species',
                            max.depth=3,
                            lambda1 = 10000.0,
                            model.type='Classification',
                            seed=-1,
                            shrinkage.factor=0.1,
                            iter.num=2)
    
    # Print the result.
    print(XGBoost_out_4$result)
    print(XGBoost_out_4$output.data)