XGBoost
Description
The td_xgboost_sqle()
function, also known as extreme Gradient Boosting, is an implementation
of the gradient boosted decision tree algorithm designed for speed and performance.
It has recently been dominating applied machine learning.
In gradient boosting, each iteration fits a model to the residuals (errors) of the
previous iteration to correct the errors made by existing models. The predicted
residual is multiplied by this learning rate and then added to the previous
prediction. Models are added sequentially until no further improvements can be made.
It is called gradient boosting because it uses a gradient descent algorithm to minimize
the loss when adding new models.
Gradient boosting involves three elements:
A loss function to be optimized.
A weak learner to make predictions.
An additive model to add weak learners to minimize the loss function.
The loss function used depends on the type of problem being solved. For example, regression
may use a squared error and binary classification may use binomial. A benefit of the gradient
boosting is that a new boosting algorithm does not have to be derived for each loss function.
Instead, it provides a generic enough framework that any differentiable loss function can be
used. The td_xgboost_sqle()
function supports both regression and classification predictive
modeling problems. The model that it creates is used in the td_xgboost_predict_sqle()
function
for making predictions.
The td_xgboost_sqle()
function supports the following features.
Regression
Multiple-Class and binary classification
Notes:
When a dataset is small, best practice is to distribute the data to one AMP. To do this, create an identifier column as a primary index, and use the same value for each row.
For Classification (softmax), a maximum of 500 classes are supported.
For Classification, while the creating tbl_teradata for the function input, the tbl_teradata column must have a deterministic output. Otherwise, the function may not run successfully or return the correct output.
The processing time is controlled by (proportional to): * The number of boosted trees (controlled by "num_boosted_trees", "tree_size", and "coverage.factor"). * The number of iterations (sub-trees) in each boosted tree (controlled by "iternum"). * The complexity of an iteration (controlled by "max.depth", "min_nod_size", "column.sampling", "min.impurity"). A careful choice of these parameters can be used to control the processing time. For example, changing "coverage.factor" from 1.0 to 2.0 doubles the number of boosted trees, which as a result, doubles the execution time roughly.
Usage
td_xgboost_sqle (
formula = NULL,
data = NULL,
input.columns = NULL,
response.column = NULL,
max.depth = 5,
min.node.size = 1,
seed = 1,
model.type = 'REGRESSION',
coverage.factor = 1.0,
min.impurity = 0.0,
lambda1 = 100000,
shrinkage.factor = 0.1,
column.sampling = 1.0,
iter.num = 10,
...
)
Arguments
formula |
Required Argument when "input.columns" and "response.column" are not provided,
optional otherwise.
Types: character |
data |
Required Argument. |
input.columns |
Required Argument when "formula" is not provided, optional otherwise.
Types: character OR vector of Strings (character) |
response.column |
Required Argument when "formula" is not provided, optional otherwise. |
max.depth |
Optional Argument.
Default Value: 5 num_boosted_trees:
Default Value: -1 |
min.node.size |
Optional Argument.
Default Value: 1 |
seed |
Optional Argument.
Default Value: 1 |
model.type |
Optional Argument.
Types: character |
coverage.factor |
Optional Argument.
Default Value: 1.0 |
min.impurity |
Optional Argument.
Default Value: 0.0 |
lambda1 |
Optional Argument.
Default Value: 100000 |
shrinkage.factor |
Optional Argument.
Default Value: 0.1 |
column.sampling |
Optional Argument.
Default Value: 1.0 |
iter.num |
Optional Argument.
Default Value: 10 tree_size:
Default Value: -1 |
... |
Specifies the generic keyword arguments SQLE functions accept. Below volatile: Function allows the user to partition, hash, order or local order the input data. These generic arguments are available for each argument that accepts tbl_teradata as input and can be accessed as:
Note: |
Value
Function returns an object of class "td_xgboost_sqle"
which is a named list containing object of class "tbl_teradata".
Named list member(s) can be referenced directly with the "$" operator
using the name(s):
result
output.data
Examples
# Get the current context/connection.
con <- td_get_context()$connection
# Load the example data.
loadExampleData("tdplyr_example", "titanic", "iris_input")
# Create tbl_teradata object.
titanic <- tbl(con, "titanic")
iris_input <- tbl(con, "iris_input")
# Check the list of available analytic functions.
display_analytic_functions()
# Example 1: Train the model using features 'age', 'survived' and 'pclass'
# whereas target value as 'fare'.
XGBoost_out_1 <- td_xgboost_sqle(
data=titanic,
input.columns=c("age", "survived", "pclass"),
response.column = 'fare',
max.depth=3,
lambda1 = 1000.0,
model.type='Regression',
seed=-1,
shrinkage.factor=0.1,
iter.num=2)
# Print the result.
print(XGBoost_out_1$result)
print(XGBoost_out_1$output.data)
# Example 2: Improve the function run time by specifying "num_boosted_trees"
# value greater than the number of AMPs.
XGBoost_out_2 <- td_xgboost_sqle(
data=titanic,
input.columns=c("age", "survived", "pclass"),
response.column = 'fare',
max.depth=3,
lambda1 = 1000.0,
model.type='Regression',
seed=-1,
shrinkage.factor=0.1,
num.boosted.tres=10,
iter.num=2)
# Print the result.
print(XGBoost_out_2$result)
print(XGBoost_out_2$output.data)
# Example 3: Train the model using titanic input and provided the "formula".
formula <- fare ~ age + survived + pclass
XGBoost_out_3 <- td_xgboost_sqle(
data=titanic,
formula=formula,
max.depth=3,
lambda1 = 10000.0,
model.type='Regression',
seed=-1,
shrinkage.factor=0.1,
iter.num=2)
# Print the result.
print(XGBoost_out_3$result)
print(XGBoost_out_3$output.data)
# Example 4: Train the model using features 'sepal_length', 'sepal_width',
# 'petal_length', 'petal_width' whereas target value as 'species'
# and model type as 'Classification'.
XGBoost_out_4 <- td_xgboost_sqle(
data=iris_input,
input.columns=c('sepal_length', 'sepal_width',
'petal_length', 'petal_width'),
response.column = 'species',
max.depth=3,
lambda1 = 10000.0,
model.type='Classification',
seed=-1,
shrinkage.factor=0.1,
iter.num=2)
# Print the result.
print(XGBoost_out_4$result)
print(XGBoost_out_4$output.data)