Description
The XGBoost function takes a training data set and uses gradient
boosting to create a strong classifying model that can be input
to the function XGBoostPredict. The function supports input
tables in both dense and sparse format.
Usage
td_xgboost_mle (
formula = NULL,
data = NULL,
id.column = NULL,
loss.function = "SOFTMAX",
prediction.type = "CLASSIFICATION",
reg.lambda = 1,
shrinkage.factor = 0.1,
iter.num = 10,
min.node.size = 1,
max.depth = 12,
variance = 0,
seed = 1,
attribute.name.column = NULL,
num.boosted.trees = NULL,
attribute.table = NULL,
attribute.value.column = NULL,
column.subsampling = 1.0,
response.column = NULL,
data.sequence.column = NULL,
attribute.table.sequence.column = NULL
)
Arguments
formula |
Required Argument when input data is in dense format.
Specifies an object of class "formula". Specifies the model to be fitted. Only
basic formula of the (col1 ~ col2 + col3 +...) form are supported and
all variables must be from the same tbl_teradata object. The
response should be column of type real, numeric, integer or boolean.
This argument is not supported for sparse format. For sparse data format
provide this information using "attribute.table" argument.
Note: This argument should not be specified along with "response.column".
|
data |
Required Argument.
Specifies the tbl_teradata object containing the input data set.
If the input data set is in dense format, the td_xgboost_mle function requires only "data".
|
id.column |
Optional Argument.
Specifies the name of the partitioning column of input table. This
column is used as a row identifier to distribute data among different
vworkers for parallel boosted trees.
|
loss.function |
Optional Argument.
Specifies the learning task and corresponding learning objective.
Default Value: "SOFTMAX"
Permitted Values: BINOMIAL, SOFTMAX
|
prediction.type |
Optional Argument.
Specifies whether the function predicts the result from the number of classes
('classification') or from a continuous response variable ('regression').
The function supports only 'classification'.
Default Value: "CLASSIFICATION"
Permitted Values: CLASSIFICATION
|
reg.lambda |
Optional Argument.
Specifies the L2 regularization that the loss function uses
while boosting trees. The higher the lambda, the stronger the
regularization effect.
Default Value: 1
|
shrinkage.factor |
Optional Argument.
Specifies the learning rate (weight) of a learned tree in each boosting step.
After each boosting step, the algorithm multiplies the learner by shrinkage
to make the boosting process more conservative. The shrinkage is a
DOUBLE PRECISION value in the range [0, 1].
The value 1 specifies no shrinkage.
Default Value: 0.1
|
iter.num |
Optional Argument.
Specifies the number of iterations to boost the weak classifiers,
which is also the number of weak classifiers in the ensemble (T). The
number must be numeric and in the range [1, 100000].
Default Value: 10
|
min.node.size |
Optional Argument.
Specifies the minimum size of any particular node within each
decision tree. The min.node.size must be numeric.
Default Value: 1
|
max.depth |
Optional Argument.
Specifies the maximum depth of the tree. The max.depth must be
numeric and in the range [1, 100000].
Default Value: 12
|
variance |
Optional Argument.
Specifies a decision-tree stopping criterion, the minimum variance for any node.
If the variance within any node becomes less than variance, the algorithm stops
looking for splits. This argument is a nonnegative DOUBLE PRECISION value.
Default Value: 0
|
seed |
Optional Argument.
Specifies the random seed the algorithm uses for repeatable results.
If you omit this argument or specify its default value 1, the function
uses a faster algorithm but does not ensure repeatability. This argument
must have a LONG value greater than or equal to 1. To ensure repeatability,
specify a value greater than 1.
Default Value: 1
|
attribute.name.column |
Optional Argument.
Required if the input data set is in sparse format.
Specifies the name of the input table column that contains the names of the
attributes of the input data set.
|
num.boosted.trees |
Optional Argument.
Specifies the number of boosted trees to be trained. By default, the
number of boosted trees equals the number of vworkers available for
the functions.
|
attribute.table |
Optional Argument.
Required argument for sparse data format.
Specifies the name of the tbl_teradata containing the features in the input
data.
If the input data set is in sparse format, the function requires both "data"
and "attribute.table" arguments.
|
attribute.value.column |
Optional Argument.
Required if the input data set is in sparse format.
Specifies the name of the input table column that contains the values of the
attributes of the input data set.
|
column.subsampling |
Optional Argument.
Specifies the fraction of features to subsample during boosting.
Default Value: 1.0 (no subsampling)
|
response.column |
Required Argument when "formula" is not specified.
Specifies the name of the input table column that contains the response variable
for each data point in the training data set.
|
data.sequence.column |
Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row
of the input argument "data". The argument is used to ensure
deterministic results for functions which produce results that vary
from run to run.
|
attribute.table.sequence.column |
Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row
of the input argument "attribute.table". The argument is used to
ensure deterministic results for functions which produce results that
vary from run to run.
|
Value
Function returns an object of class "td_xgboost_mle" which is a named
list containing Teradata tbl objects.
Named list members can be referenced directly with the "$" operator
using following names:
model.table
output
Examples
# Get the current context/connection
con <- td_get_context()$connection
# Load example data.
loadExampleData("xgboost_example", "housing_train_binary","iris_train","sparse_iris_train","sparse_iris_attribute")
#Example 1: Binary Classification
# Create remote tibble objects.
housing_train_binary <- tbl(con, "housing_train_binary")
td_xgboost_out1 <- td_xgboost_mle(data=housing_train_binary,
id.column='sn',
formula = ( homestyle ~ driveway + recroom + fullbase + gashw + airco + prefarea + price + lotsize + bedrooms + bathrms + stories + garagepl ),
num.boosted.trees=2,
loss.function='binomial',
prediction.type='classification',
reg.lambda=1,
shrinkage.factor=0.1,
iter.num=10,
min.node.size=1,
max.depth=10
)
#Example 2: Multiple-Class Classification
iris_train <- tbl(con,"iris_train")
td_xgboost_out2 <- td_xgboost_mle(data=iris_train,
id.column='id',
formula = ( species ~ sepal_length + sepal_length + petal_length + petal_width + species),
num.boosted.trees=2,
loss.function='softmax',
reg.lambda=1,
shrinkage.factor=0.1,
iter.num=10,
min.node.size=1,
max.depth=10)
#Example 3: Sparse Input Format. response.column argument is specified instead of formula.
sparse_iris_train <- tbl(con,"sparse_iris_train")
sparse_iris_attribute <- tbl(con,"sparse_iris_attribute")
td_xgboost_out3 <- td_xgboost_mle(data=sparse_iris_train,
attribute.table=sparse_iris_attribute,
id.column='id',
attribute.name.column='attribute',
attribute.value.column='value_col',
response.column="species",
loss.function='SOFTMAX',
reg.lambda=1,
num.boosted.trees=2,
shrinkage.factor=0.1,
column.subsampling=1.0,
iter.num=10,
min.node.size=1,
max.depth=10,
variance=0,
seed=1
)