TD_XGBoost function, also known as eXtreme Gradient Boosting, is an implementation of the gradient boosted decision tree designed for speed and performance. It has recently been dominating applied machine learning.
In gradient boosting, each iteration fits a model to the residuals (errors) of the previous iteration to correct the errors made by existing models. The predicted residual is multiplied by this learning rate and then added to the previous prediction. Models are added sequentially until no further improvements can be made. It is called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models.
- A loss function to be optimized.
- A weak learner to make predictions.
- An additive model to add weak learners to minimize the loss function.
- Regression: The prediction is based on continuous values. XGBoost regression calculates the difference between the current prediction and the known correct target value. This difference is called residual. After that, XGBoost regression trains a weak model that maps features to that residual. This residual predicted by a weak model is added to the existing model input, and thus this process nudges the model towards the correct target. Repeating this step improves the overall model prediction.
- Classification: Similar to regression, XGBoost uses regression trees for classification. In this case, the residual is computed by converting odds (a ratio between the number of events and non-events) to probability and the probability is expressed through log odds to obtain residuals. For example, if your data contains three spam emails and two non-spam emails, the odds are 3:2, that is, 1.5 in decimal notation.
- Regression
- Multiple class and binary classification
- When a dataset is small, best practice is to distribute the data to one AMP. To do this, create an identifier column as a primary index, and use the same value for each row.
- For Classification (softmax), a maximum of 500 classes are supported.
- For Classification, while using a SELECT statement for the function input, the SELECT statement must have a deterministic output. Otherwise, the function may not run successfully or return the correct output. For example, the function must not have an ON clause such as "SELECT top 500 * from table_t".
- The processing time is controlled by (proportional to):
- The number of boosted trees (controlled by NumBoostedTrees, TreeSize, and CoverageFactor).
- The number of iterations (sub-trees) in each boosted tree (controlled by IterNum).
- The complexity of an iteration (controlled by MaxDepth, MinNodeSize, ColumnSampling, MinImpurity).
A careful choice of these parameters can be used to control the processing time. For example, changing CoverageFactor from 1.0 to 2.0 doubles the number of boosted trees, which as a result, doubles the execution time roughly.
- Classification with imbalanced datasets: When dealing with classification tasks and encountering an imbalanced dataset, such as having 1% rows for minority class 1 and 99% rows labeled as majority class 0, it is advisable to reduce the number of AMPs that hold the input table rows. This redistribution increases the likelihood of each AMP containing minority class data rows, allowing for better training of sub-tree models on the minority class.
- Large cluster with smaller datasets: In situations where the cluster consists of a large number of AMPs (for example, 200 AMPs) and the dataset size (total number of input table rows) is relatively small (for example, 1000 or fewer rows), redistributing the data rows to fewer AMPs can improve the quality of the model. This redistribution to fewer AMPs enables better training of sub-tree models because each model trains on a more substantial training sample.