TD_XGBoost function, also known as eXtreme Gradient Boosting, is an implementation of the gradient boosted decision tree designed for speed and performance. It has recently been dominating applied machine learning.
In gradient boosting, each iteration fits a model to the residuals (errors) of the previous iteration to correct the errors made by existing models. The predicted residual is multiplied by this learning rate and then added to the previous prediction. Models are added sequentially until no further improvements can be made. It is called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models.
- A loss function to be optimized.
- A weak learner to make predictions.
- An additive model to add weak learners to minimize the loss function.
- Regression: The prediction is based on continuous values. XGBoost regression calculates the difference between the current prediction and the known correct target value. This difference is called residual. After that, XGBoost regression trains a weak model that maps features to that residual. This residual predicted by a weak model is added to the existing model input, and thus this process nudges the model towards the correct target. Repeating this step improves the overall model prediction.
- Classification: Similar to regression, XGBoost uses regression trees for classification. In this case, the residual is computed by converting odds (a ratio between the number of events and non-events) to probability and the probability is expressed through log odds to obtain residuals. For example, if your data contains three spam emails and two non-spam emails, the odds are 3:2, that is, 1.5 in decimal notation.
- Regression
- Multiple class and binary classification
Usage
- When a dataset is small, best practice is to distribute the data to one AMP. To do this, create an identifier column as a primary index, and use the same value for each row.
This can also be achieved using DataReDistributionColumn and MinRowsPerAmp.
- For Classification (softmax), a maximum of 500 classes are supported.
- For Classification, while using a SELECT statement for the function input, the SELECT statement must have a deterministic output. Otherwise, the function may not run successfully or return the correct output. For example, the function must not have an ON clause such as "SELECT top 500 * from table_t".
- The processing time is controlled by (proportional to):
- The number of boosted trees (controlled by NumBoostedTrees, TreeSize, and CoverageFactor).
- The number of iterations (sub-trees) in each boosted tree (controlled by IterNum).
- The complexity of an iteration (controlled by MaxDepth, MinNodeSize, ColumnSampling, MinImpurity).
A careful choice of these parameters can be used to control the processing time. For example, changing CoverageFactor from 1.0 to 2.0 doubles the number of boosted trees, which as a result, doubles the execution time roughly.
- Use DataRedistributionColumn, in combination with MinRowsPerAmp, to distribute data to any number of AMPs using a specified column. This can be helpful under the following scenarios:
- The dataset is small and needs to be distributed to fewer AMPs.
- The dataset is imbalanced and needs to be distributed based on a column uncorrelated with class labels so that all classes are present on each AMP for modeling.
- Data distribution needs to be changed to eliminate skewness and distribute data evenly among the AMPs.
DataRedistributionColumn specifies the column to be used for distributing the data among AMPs. The column values are hashed to particular AMPs for redistribution while maintaining the constraints specified in MinRowsPerAMP. The value specified in MinRowsPerAMP is used to establish how many AMPs data should be redistributed to (provided there are enough unique values in the DataRedistributionColumn).Data redistribution can impact execution time when the number of unique values in DataRedistributionColumn are few compared to the number of AMPs in the system. - When DataRedistributionColumn is used, a column is added to the input table as part of the data redistribution; TD_XGBoost can only support 2047 columns in the input table. See TD_XGBoost using Data Redistribution.
- TD_XGBoost input table supports up to 2047 columns when DataRedistributionColumn is used.
- Classification with imbalanced datasets: When dealing with classification tasks and encountering an imbalanced dataset, such as having 1% rows for minority class 1 and 99% rows labeled as majority class 0, it is advisable to reduce the number of AMPs that hold the input table rows. This redistribution increases the likelihood of each AMP containing minority class data rows, allowing for better training of sub-tree models on the minority class.
- Large cluster with smaller datasets: In situations where the cluster consists of a large number of AMPs (for example, 200 AMPs) and the dataset size (total number of input table rows) is relatively small (for example, 1000 or fewer rows), redistributing the data rows to fewer AMPs can improve the quality of the model. This redistribution to fewer AMPs enables better training of sub-tree models because each model trains on a more substantial training sample.