teradataml OpenSourceML exposes lightGBM package through an interface object td_lightgbm. Use this interface object to execute all supported lightGBM’s functions with the same syntax and arguments without pulling the data to client using MPP capabilities.
teradataml OpenSourceML’s td_lightgbm trains and scores models in both single model approach and distributed/multi-model approach. However, there are few things to note when working with lightGBM in distributed model training:
- teradataml OpenSourceML has introduced an argument partition_columns that can be used with any lightGBM function.
- partition_columns argument accepts the names of the columns used for partitioning.
- Generates model for each unique partition.
- Column names specified should be present in the parent teradataml DataFrame from which input teradataml DataFrames are derived.
- If parent DataFrame does not contain the columns, then teradataml raises an exception.
- When distributed models are generated per unique partition by fit() or train() methods, you may or may not provide partition_columns in predict or other functions as teradataml OpenSourceML internally picks partition_columns from trained model if this argument is not provided.
The following sections detail how to use teradataml’s td_lightgbm to run supported lightGBM functions - Dataset, Booster, train, cv, and all scikit-learn functions - to generate single model and distributed-model (multi-model) through partition_columns argument and supportability information.