| |
Methods defined here:
- __init__(self, formula=None, data=None, id_column=None, loss_function='SOFTMAX', prediction_type='CLASSIFICATION', reg_lambda=1.0, shrinkage_factor=0.1, iter_num=10, min_node_size=1, max_depth=5, variance=0.0, seed=None, attribute_name_column=None, num_boosted_trees=None, attribute_table=None, attribute_value_column=None, column_subsampling=1.0, response_column=None, data_sequence_column=None, attribute_table_sequence_column=None, output_accuracy=False)
- DESCRIPTION:
The XGBoost function takes a training data set and uses gradient
boosting to create a strong classifying model that can be input to
the function XGBoostPredict. The function supports input tables in
both dense and sparse format.
PARAMETERS:
formula:
Required Argument when input data is in dense format.
A string consisting of "formula". Specifies the model to be fitted.
Only basic formula of the "col1 ~ col2 + col3 +..." form are
supported and all variables must be from the same teradataml
DataFrame object. The response should be column of type float, int or
bool. This argument is not supported for sparse format. For sparse data
format, provide this information using "attribute_table" argument.
data:
Required Argument.
Specifies the teradataml DataFrame containing the input data set.
If the input data set is in dense format, the XGBoost function requires only "data".
id_column:
Optional Argument.
Specifies the name of the partitioning column of input table. This
column is used as a row identifier to distribute data among different
vworkers for parallel boosted trees.
Types: str OR list of Strings (str)
loss_function:
Optional Argument.
Specifies the learning task and corresponding learning objective.
Default Value: "SOFTMAX"
Permitted Values: BINOMIAL, SOFTMAX, MSE
Note:
Permitted value 'MSE' is supported when teradataml is connected to Vantage1.3
or later versions.
Types: str
prediction_type:
Optional Argument.
Specifies whether the function predicts the result from the number of classes
('classification') or from a continuous response variable ('regression').
The function supports only 'classification'.
Default Value: "CLASSIFICATION"
Permitted Values: CLASSIFICATION, REGRESSION
Note:
Permitted value 'REGRESSION' is supported when teradataml is connected to Vantage1.3
or later versions.
Types: str
reg_lambda:
Optional Argument.
Specifies the L2 regularization that the loss function uses
while boosting trees. The higher the lambda, the stronger the
regularization effect.
Default Value: 1.0
Types: float
shrinkage_factor:
Optional Argument.
Specifies the learning rate (weight) of a learned tree in each boosting step.
After each boosting step, the algorithm multiplies the learner by shrinkage
to make the boosting process more conservative. The shrinkage is a
float value in the range [0.0, 1.0].
The value 1.0 specifies no shrinkage.
Default Value: 0.1
Types: float
iter_num:
Optional Argument.
Specifies the number of iterations to boost the weak classifiers,
which is also the number of weak classifiers in the ensemble (T). The
number must an int in the range [1, 100000].
Default Value: 10
Types: int
min_node_size:
Optional Argument.
Specifies the minimum size of any particular node within each
decision tree. The min_node_size must an int.
Default Value: 1
Types: int
max_depth:
Optional Argument.
Specifies the maximum depth of the tree. The max_depth must be an int in
the range [1, 100000].
Default Value: 5
Types: int
variance:
Optional Argument.
Specifies a decision tree stopping criterion. If the variance within
any node dips below this value, the algorithm stops looking for splits
in the branch.
Default Value: 0.0
Types: float
seed:
Optional Argument.
Specifies the seed to use to create a random number.
If you omit this argument or specify its default value 1, the function
uses a faster algorithm but does not ensure repeatability.
This argument must have a int value greater than or equal to 1. To ensure
repeatability, specify a value greater than 1.
Types: int
attribute_name_column:
Optional Argument.
Required for sparse data format. If the data set is in sparse format,
this argument indicates the column containing the attributes in the
input data set.
Types: str OR list of Strings (str)
num_boosted_trees:
Optional Argument.
Specifies the number of boosted trees to be trained. By default, the
number of boosted trees equals the number of vworkers available for
the functions.
Types: int
attribute_table:
Optional Argument.
Required if the input data set is in sparse format.
Specifies the name of the teradataml DataFrame containing the features in the input
data.
attribute_value_column:
Optional Argument.
Required if the input data set is in sparse format.
If the data is in the sparse format, this argument indicates the
column containing the attribute values in the input table.
Types: str OR list of Strings (str)
column_subsampling:
Optional Argument.
Specifies the fraction of features to subsample during boosting.
Default Value: 1.0 (no subsampling)
Types: float
response_column:
Optional Argument.
Specifies the name of the response teradataml DataFrame column that
contains the responses (labels) of the data.
Types: str OR list of Strings (str)
data_sequence_column:
Optional Argument.
Specifies the list of column(s) that uniquely identifies each row of
the input argument "data". The argument is used to ensure
deterministic results for functions which produce results that vary
from run to run.
Types: str OR list of Strings (str)
attribute_table_sequence_column:
Optional Argument.
Specifies the list of column(s) that uniquely identifies each row of
the input argument "attribute_table". The argument is used to ensure
deterministic results for functions which produce results that vary
from run to run.
Types: str OR list of Strings (str)
output_accuracy:
Optional Argument.
Specifies whether to show training accuracy over iterations in the
output model_table DataFrame.
Note:
The argument 'output_accuracy' is available when teradataml is connected to Vantage 1.3
or later versions.
Default Value: False
Types: bool
RETURNS:
Instance of XGBoost.
Output teradataml DataFrames can be accessed using attribute
references, such as XGBoostObj.<attribute_name>.
Output teradataml DataFrame attribute name is:
1. model_table
2. output
RAISES:
TeradataMlException
EXAMPLES:
# Load example data.
load_example_data("xgboost", ["housing_train_binary","iris_train","sparse_iris_train","sparse_iris_attribute"])
# Example 1: Binary Classification on the housing data to build a model using training data
# that contains couple of labels (Responses) - classic and eclectic, specifying the style of a house,
# based on the 12 other attributes of the house, such as bedrooms, stories, price etc.
# Create teradataml DataFrame objects.
housing_train_binary = DataFrame.from_table("housing_train_binary")
XGBoost_out1 = XGBoost(data=housing_train_binary,
id_column='sn',
formula = "homestyle ~ driveway + recroom + fullbase + gashw + airco + prefarea + price + lotsize + bedrooms + bathrms + stories + garagepl",
num_boosted_trees=2,
loss_function='binomial',
prediction_type='classification',
reg_lambda=1.0,
shrinkage_factor=0.1,
iter_num=10,
min_node_size=1,
max_depth=10
)
# Print the results.
print(XGBoost_out1)
# Example 2: Multiple-Class Classification
# Let's use the XGBoost classification algorithm, on one of the famous dataset Iris Data set.
# Dataset contains 50 samples from three species of Iris flower setosa, virginica and versicolor.
# Each data point contains measurements of length and width of sepals and petals.
iris_train = DataFrame.from_table("iris_train")
XGBoost_out2 = XGBoost(data=iris_train,
id_column='id',
formula = "species ~ sepal_length + petal_length + petal_width + species",
num_boosted_trees=2,
loss_function='softmax',
reg_lambda=1.0,
shrinkage_factor=0.1,
iter_num=10,
min_node_size=1,
max_depth=10)
# Print the results.
print(XGBoost_out2)
# Example 3: Sparse Input Format. response_column argument is specified instead of formula.
sparse_iris_train = DataFrame.from_table("sparse_iris_train")
sparse_iris_attribute = DataFrame.from_table("sparse_iris_attribute")
XGBoost_out3 = XGBoost(data=sparse_iris_train,
attribute_table=sparse_iris_attribute,
id_column='id',
attribute_name_column='attribute',
attribute_value_column='value_col',
response_column="species",
loss_function='SOFTMAX',
reg_lambda=1.0,
num_boosted_trees=2,
shrinkage_factor=0.1,
column_subsampling=1.0,
iter_num=10,
min_node_size=1,
max_depth=10,
variance=0.0,
seed=1
)
# Print the results.
print(XGBoost_out3)
# Example 4: Use optional argument 'output_accuracy'.
# We will use the teradataml DataFrames, created in the Example 3.
Note:
This Example will work only when teradataml is connected to Vantage 1.3
or later versions.
XGBoost_out4 = XGBoost(data=sparse_iris_train,
attribute_table=sparse_iris_attribute,
id_column='id',
attribute_name_column='attribute',
attribute_value_column='value_col',
response_column="species",
loss_function='SOFTMAX',
reg_lambda=1.0,
num_boosted_trees=2,
shrinkage_factor=0.1,
column_subsampling=1.0,
iter_num=10,
min_node_size=1,
max_depth=10,
variance=0.0,
seed=1,
output_accuracy=True
)
# Print the results.
print(XGBoost_out3)
- __repr__(self)
- Returns the string representation for a XGBoost class instance.
- get_build_time(self)
- Function to return the build time of the algorithm in seconds.
When model object is created using retrieve_model(), then the value returned is
as saved in the Model Catalog.
- get_prediction_type(self)
- Function to return the Prediction type of the algorithm.
When model object is created using retrieve_model(), then the value returned is
as saved in the Model Catalog.
- get_target_column(self)
- Function to return the Target Column of the algorithm.
When model object is created using retrieve_model(), then the value returned is
as saved in the Model Catalog.
- show_query(self)
- Function to return the underlying SQL query.
When model object is created using retrieve_model(), then None is returned.
|