TD_XGBoost Function | XGBoost | Teradata Vantage - TD_XGBoost - Analytics Database

Database Analytic Functions

Deployment
VantageCloud
VantageCore
Edition
VMware
Enterprise
IntelliFlex
Product
Analytics Database
Release Number
17.20
Published
June 2022
ft:locale
en-US
ft:lastEdition
2025-11-06
dita:mapPath
gjn1627595495337.ditamap
dita:ditavalPath
qkf1628213546010.ditaval
dita:id
jmh1512506877710
Product Category
Teradata Vantageā„¢

TD_XGBoost function, also known as eXtreme Gradient Boosting, is an implementation of the gradient boosted decision tree designed for speed and performance. It has recently been dominating applied machine learning.

In gradient boosting, each iteration fits a model to the residuals (errors) of the previous iteration to correct the errors made by existing models. The predicted residual is multiplied by this learning rate and then added to the previous prediction. Models are added sequentially until no further improvements can be made. It is called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models.

Gradient boosting involves three elements:
  • A loss function to be optimized.
  • A weak learner to make predictions.
  • An additive model to add weak learners to minimize the loss function.
The loss function used depends on the type of problem being solved. For example, regression may use a squared error and binary classification may use binomial. A benefit of the gradient boosting is that a new boosting algorithm does not have to be derived for each loss function. Instead, it provides a generic enough framework that any differentiable loss function can be used. TD_XGBoost function supports both regression and classification predictive modeling problems. The model that it creates is used in the TD_XGBoostPredict function for making predictions.
  • Regression: The prediction is based on continuous values. XGBoost regression calculates the difference between the current prediction and the known correct target value. This difference is called residual. After that, XGBoost regression trains a weak model that maps features to that residual. This residual predicted by a weak model is added to the existing model input, and thus this process nudges the model towards the correct target. Repeating this step improves the overall model prediction.
  • Classification: Similar to regression, XGBoost uses regression trees for classification. In this case, the residual is computed by converting odds (a ratio between the number of events and non-events) to probability and the probability is expressed through log odds to obtain residuals. For example, if your data contains three spam emails and two non-spam emails, the odds are 3:2, that is, 1.5 in decimal notation.
TD_XGBoost function supports the following features.
  • Regression
  • Multiple class and binary classification

Usage

  • When a dataset is small, best practice is to distribute the data to one AMP. To do this, create an identifier column as a primary index, and use the same value for each row.

    This can also be achieved using DataReDistributionColumn and MinRowsPerAmp.

  • For Classification (softmax), a maximum of 500 classes are supported.
  • For Classification, while using a SELECT statement for the function input, the SELECT statement must have a deterministic output. Otherwise, the function may not run successfully or return the correct output. For example, the function must not have an ON clause such as "SELECT top 500 * from table_t".
  • The processing time is controlled by (proportional to):
    • The number of boosted trees (controlled by NumBoostedTrees, TreeSize, and CoverageFactor).
    • The number of iterations (sub-trees) in each boosted tree (controlled by IterNum).
    • The complexity of an iteration (controlled by MaxDepth, MinNodeSize, ColumnSampling, MinImpurity).

    A careful choice of these parameters can be used to control the processing time. For example, changing CoverageFactor from 1.0 to 2.0 doubles the number of boosted trees, which as a result, doubles the execution time roughly.

  • Use DataRedistributionColumn, in combination with MinRowsPerAmp, to distribute data to any number of AMPs using a specified column. This can be helpful under the following scenarios:
    • The dataset is small and needs to be distributed to fewer AMPs.
    • The dataset is imbalanced and needs to be distributed based on a column uncorrelated with class labels so that all classes are present on each AMP for modeling.
    • Data distribution needs to be changed to eliminate skewness and distribute data evenly among the AMPs.
    DataRedistributionColumn specifies the column to be used for distributing the data among AMPs. The column values are hashed to particular AMPs for redistribution while maintaining the constraints specified in MinRowsPerAMP. The value specified in MinRowsPerAMP is used to establish how many AMPs data should be redistributed to (provided there are enough unique values in the DataRedistributionColumn).
    Data redistribution can impact execution time when the number of unique values in DataRedistributionColumn are few compared to the number of AMPs in the system.
  • When DataRedistributionColumn is used, a column is added to the input table as part of the data redistribution; TD_XGBoost can only support 2047 columns in the input table. See TD_XGBoost using Data Redistribution.
  • TD_XGBoost input table supports up to 2047 columns when DataRedistributionColumn is used.
Teradata recommends redistributing the input table rows to one or fewer AMPs in the following cases for improved training results:
  1. Classification with imbalanced datasets: When dealing with classification tasks and encountering an imbalanced dataset, such as having 1% rows for minority class 1 and 99% rows labeled as majority class 0, it is advisable to reduce the number of AMPs that hold the input table rows. This redistribution increases the likelihood of each AMP containing minority class data rows, allowing for better training of sub-tree models on the minority class.
  2. Large cluster with smaller datasets: In situations where the cluster consists of a large number of AMPs (for example, 200 AMPs) and the dataset size (total number of input table rows) is relatively small (for example, 1000 or fewer rows), redistributing the data rows to fewer AMPs can improve the quality of the model. This redistribution to fewer AMPs enables better training of sub-tree models because each model trains on a more substantial training sample.