TD_SMOTE Function | SMOTE | Teradata Vantage - TD_SMOTE - Teradata Vantage

Teradata® VantageCloud Lake

Deployment
VantageCloud
Edition
Lake
Product
Teradata Vantage
Published
January 2023
ft:locale
en-US
ft:lastEdition
2024-12-11
dita:mapPath
phg1621910019905.ditamap
dita:ditavalPath
pny1626732985837.ditaval
dita:id
phg1621910019905

Many of the real-world datasets for classification are imbalanced such that observations belonging to one class (minority class) are much fewer than the observations belonging to the other class (majority class). The challenge of working with imbalanced datasets is that most machine learning techniques model the majority class more optimally and have poor performance on the minority class whereas in many situations, the minority class is a more important class.

One approach to addressing imbalanced datasets is to oversample the minority class. The simplest approach involves duplicating examples in the minority class, although these examples do not add any new information to the model. Instead, new examples can be synthesized from the existing examples using a technique called Synthetic Minority Oversampling Technique (SMOTE).

TD_SMOTE function implements SMOTE and three variations:
  • SMOTE algorithm generates samples from a random nearest neighbor by using random linear interpolation with the original sample.
  • Adaptive Synthetic Sampling Approach or ADASYN aims for sampling from datasets where the neighbors density from majority class is larger. See He, H., Bai, Y., Garcia, E. A., & Li, S. (2008, June). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence) (pp. 1322-1328).
  • Borderline aims for sampling from the border group which are the minority samples closer to the boundary with the majority class. See Han, H., Wang, W. Y., & Mao, B. H. (2005, August). Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In International conference on intelligent computing (pp. 878-887). Berlin, Heidelberg: Springer Berlin Heidelberg. This function implements Borderline-2 algorithm mentioned here.
  • Synthetic Minority Over-sampling TEchnique-Nominal Continuous or SMOTE-NC is a generalization of SMOTE to handle mixed datasets of continuous and nominal features. See Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.

TD_SMOTE can handle multiclass datasets. However, the function can only sample one minority class at a time and considers all other classes than minority as majority.