Description
The XGBoost function takes a training data set and uses gradient
boosting to create a strong classifying model that can be input
to the function XGBoostPredict (td_xgboost_predict_mle
).
The function supports input tables in both dense and sparse format.
Usage
td_xgboost_mle (
formula = NULL,
data = NULL,
id.column = NULL,
loss.function = "SOFTMAX",
prediction.type = "CLASSIFICATION",
reg.lambda = 1,
shrinkage.factor = 0.1,
iter.num = 10,
min.node.size = 1,
max.depth = 5,
variance = 0,
seed = NULL,
attribute.name.column = NULL,
num.boosted.trees = NULL,
attribute.table = NULL,
attribute.value.column = NULL,
column.subsampling = 1.0,
response.column = NULL,
data.sequence.column = NULL,
attribute.table.sequence.column = NULL
)
Arguments
formula |
Required Argument when input data is in dense format.
Specifies an object of class "formula". Specifies the model to be fitted. Only
basic formula of the (col1 ~ col2 + col3 +...) form are supported and
all variables must be from the same tbl_teradata object. The
response should be column of type real, numeric, integer or boolean.
This argument is not supported for sparse format. For sparse data format
provide this information using "attribute.table" argument.
Note: This argument should not be specified along with "response.column".
|
data |
Required Argument.
Specifies the tbl_teradata object containing the input data set.
If the input data set is in dense format, the td_xgboost_mle function requires only "data".
|
id.column |
Optional Argument.
Specifies the name of the partitioning column of input tbl_teradata. This
column is used as a row identifier to distribute data among different
vworkers for parallel boosted trees.
Types: character
|
loss.function |
Optional Argument.
Specifies the learning task and corresponding learning objective.
Default Value: "SOFTMAX"
Permitted Values: BINOMIAL, SOFTMAX
Types: character
|
prediction.type |
Optional Argument.
Specifies whether the function predicts the result from the number of classes
('classification') or from a continuous response variable ('regression').
The function supports only 'classification'.
Default Value: "CLASSIFICATION"
Permitted Values: CLASSIFICATION
Types: character
|
reg.lambda |
Optional Argument.
Specifies the L2 regularization that the loss function uses
while boosting trees. The higher the lambda, the stronger the
regularization effect.
Default Value: 1
Types: numeric
|
shrinkage.factor |
Optional Argument.
Specifies the learning rate (weight) of a learned tree in each boosting step.
After each boosting step, the algorithm multiplies the learner by shrinkage
to make the boosting process more conservative. The shrinkage is a
DOUBLE PRECISION value in the range [0, 1].
The value 1 specifies no shrinkage.
Default Value: 0.1
Types: numeric
|
iter.num |
Optional Argument.
Specifies the number of iterations to boost the weak classifiers,
which is also the number of weak classifiers in the ensemble (T). The
number must an in the range [1, 100000].
Default Value: 10
Types: integer
|
min.node.size |
Optional Argument.
Specifies the minimum size of any particular node within each
decision tree.
Default Value: 1
Types: integer
|
max.depth |
Optional Argument.
Specifies the maximum depth of the tree. The "max.depth" must be
in the range [1, 100000].
Default Value: 12
Types: integer
|
variance |
Optional Argument.
Specifies a decision tree stopping criterion. If the variance within
any node dips below this value, the algorithm stops looking for splits
in the branch.
Default Value: 0
Types: numeric
|
seed |
Optional Argument.
Specifies the random seed the algorithm uses for repeatable results.
If you omit this argument or specify its default value 1, the function
uses a faster algorithm but does not ensure repeatability. This argument
must be greater than or equal to 1. To ensure repeatability,
specify a value greater than 1.
Default Value: 1
Types: numeric
|
attribute.name.column |
Optional Argument.
Required if the input data set is in sparse format.
Specifies the column containing the attributes in the
input data set.
Types: character
|
num.boosted.trees |
Optional Argument.
Specifies the number of boosted trees to be trained. By default, the
number of boosted trees equals the number of vworkers available for
the functions.
Types: integer
|
attribute.table |
Optional Argument.
Required argument for sparse data format.
Specifies the name of the tbl_teradata containing the features in the input
data.
If the input data set is in sparse format, the function requires both "data"
and "attribute.table" arguments.
|
attribute.value.column |
Required if the input data set is in sparse format.
If the data is in the sparse format, this argument indicates the
column containing the attribute values in the input tbl_teradata.
Types: character
|
column.subsampling |
Optional Argument.
Specifies the fraction of features to subsample during boosting.
Default Value: 1.0 (no subsampling)
Types: numeric
|
response.column |
Required Argument when "formula" is not specified.
Specifies the name of the input tbl_teradata column that contains the response variable
for each data point in the training data set.
Types: character
|
data.sequence.column |
Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row
of the input argument "data". The argument is used to ensure
deterministic results for functions which produce results that vary
from run to run.
Types: character OR vector of Strings (character)
|
attribute.table.sequence.column |
Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row
of the input argument "attribute.table". The argument is used to
ensure deterministic results for functions which produce results that
vary from run to run.
Types: character OR vector of Strings (character)
|
Value
Function returns an object of class "td_xgboost_mle" which is a named
list containing objects of class "tbl_teradata".
Named list members can be referenced directly with the "$" operator
using following names:
model.table
output
Examples
# Get the current context/connection
con <- td_get_context()$connection
# Load example data.
loadExampleData("xgboost_example", "housing_train_binary","iris_train","sparse_iris_train","sparse_iris_attribute")
#Example 1: Binary Classification
# Create object(s) of class "tbl_teradata".
housing_train_binary <- tbl(con, "housing_train_binary")
td_xgboost_out1 <- td_xgboost_mle(data=housing_train_binary,
id.column='sn',
formula = ( homestyle ~ driveway + recroom + fullbase + gashw + airco + prefarea + price + lotsize + bedrooms + bathrms + stories + garagepl ),
num.boosted.trees=2,
loss.function='binomial',
prediction.type='classification',
reg.lambda=1,
shrinkage.factor=0.1,
iter.num=10,
min.node.size=1,
max.depth=10
)
#Example 2: Multiple-Class Classification
iris_train <- tbl(con,"iris_train")
td_xgboost_out2 <- td_xgboost_mle(data=iris_train,
id.column='id',
formula = ( species ~ sepal_length + sepal_length + petal_length + petal_width + species),
num.boosted.trees=2,
loss.function='softmax',
reg.lambda=1,
shrinkage.factor=0.1,
iter.num=10,
min.node.size=1,
max.depth=10)
#Example 3: Sparse Input Format. "response.column" argument is specified instead of formula.
sparse_iris_train <- tbl(con,"sparse_iris_train")
sparse_iris_attribute <- tbl(con,"sparse_iris_attribute")
td_xgboost_out3 <- td_xgboost_mle(data=sparse_iris_train,
attribute.table=sparse_iris_attribute,
id.column='id',
attribute.name.column='attribute',
attribute.value.column='value_col',
response.column="species",
loss.function='SOFTMAX',
reg.lambda=1,
num.boosted.trees=2,
shrinkage.factor=0.1,
column.subsampling=1.0,
iter.num=10,
min.node.size=1,
max.depth=10,
variance=0,
seed=1
)