Optional Syntax Elements for TD_SMOTE - Teradata Vantage

Teradata® VantageCloud Lake

Deployment
VantageCloud
Edition
Lake
Product
Teradata Vantage
Published
January 2023
ft:locale
en-US
ft:lastEdition
2024-12-11
dita:mapPath
phg1621910019905.ditamap
dita:ditavalPath
pny1626732985837.ditaval
dita:id
phg1621910019905
ON clause
Specifies the table name as an EncodingsTable.
You must create EncodingsTable from TD_OrdinalEncodingFit function using Approach ('AUTO') without DefaultValue argument.
  • EncodingsTable is the precomputed output of TD_OrdinalEncodingFit using the categorical input columns as TargetColumns. Hence, no query is allowed, it is mandatory for the user to supply the table.
  • In 'smotenc' aliasing the columns is not allowed neither for InputTable nor EncodingsTable.
CategoricalInputColumns
Specifies the input table columns names that need to be used for oversampling only with 'smotenc' sampling strategy.
Required for smotenc.
MedianStandardDeviation
Specifies the median of standard deviation for the numerical input columns in the minority class used only with 'smotenc' sampling strategy. The SMOTENC algorithm uses this value to encode nominal to numerical values and you can obtain this value with a query such as the following:
SELECT StatValue AS MedianValue FROM TD_UnivariateStatistics (
ON (
  SELECT * FROM TD_UnivariateStatistics (
    ON (SELECT * FROM mydatatable WHERE response_column=minority_class_value) AS InputTable
    USING
    TargetColumns ('input_Columns')
    Stats (''STANDARD DEVIATION'')
  ) as dt) AS InputTable
  USING
  TargetColumns ('StatValue')
  Stats ('MED')
) AS dtu;
The rows passed as InputTable to TD_UnivariateStatistics are all the samples in the minority class only. These rows have the same class label provided in ResponseColumn for TD_SMOTE with sampling strategy 'smotenc'.
The columns in TargetColumns are those to be used in the InputColumns argument for TD_SMOTE with sampling strategy 'smotenc'.
Required for smotenc.
OversamplingFactor
Specifies the factor for oversampling the minority class.
The value must be positive. A value of 1.0 generates as many samples as there are the minority samples. A value of 3 generates three times the number of minority samples, and so on.
For example, specifying 0.5 for a minority class with 100 observations create 50 synthetic samples. However, this may not hold for ADASYN which computes a probability based on neighbor density local to an AMP and resultant number of synthetic samples may not be exactly the same as provided in the OversamplingFactor.
Default: 5.
SamplingStrategy
Specifies the oversampling algorithm to use for creating synthetic samples.
Accepted values are: 'smote', 'adasyn' ,'borderline', and 'smotenc'.
For sampling strategies borderline and adasyn, when the class imbalance is very large or the number of minority samples is very small, then it is advised to oversample by fractions of what is required instead of using very large values for OversamplingFactor. This will avoid having duplicates.
Default: 'smote'.
FillSampleID
Specifies whether the function writes out the id of the observation used to generate the corresponding new synthetic observations. If FillSamplID is false, the column indicated in IDColumn will be empty (NULL values).
Default: true.
ValueForNonInputColumns
Specifies the value to put in a sample column for columns not specified as input columns.
Accepted values are: 'sample', 'neighbor', and 'null'.
  • If value is 'sample', then it is used the corresponding column value in the minority sample in the original dataset from which this new synthetic sample is generated.
  • If value is 'neighbor', then it is used the corresponding column value in the neighbor from which this sample is generated.
  • If value is 'null', the sample column does not have a value.
Default: 'sample'.
NumberOfNeighbors
Specifies the nearest neighbors number for choosing the sample to be used in oversampling. The NumberOfNeighbors must be a positive integer value <= 100.
Carefully choose the value of number_of_neighbors. A larger value of number_of_neighbors imply a larger neighborhood for a data point, which can potentially belong to other classes decreasing the effectiveness of synthetic data and also results into more computation. An optimal value depends on the amount of imbalance. A larger value of number_of_neighbors should only be preferred for significant imbalances. Typically, a value between 5-10 works reasonably well. Furthermore, as the current TD_KNN output has the observation itself as the first nearest neighbor, for NumberOfNeighbors=5 will use only 4 neighbors for random interpolation. So add 1 to have expected neighbors to sample from, this example set NumberOfNeighbors=6.
Default: 5.
Seed
Specifies the seed to use for sampling, random selection of nearest neighbor and sampling a point in the feature space between a data point and its selected nearest neighbor using convex combination. The seed must be a non-negative integer value. Assures deterministic results.
Default: 186006.