TD_OutlierFilterTransform Function | OutlierFilterTransform - TD_OutlierFilterTransform - Teradata Vantage

Teradata® VantageCloud Lake

Deployment
VantageCloud
Edition
Lake
Product
Teradata Vantage
Published
January 2023
Language
English (United States)
Last Update
2024-04-03
dita:mapPath
phg1621910019905.ditamap
dita:ditavalPath
pny1626732985837.ditaval
dita:id
phg1621910019905

TD_OutlierFilterTransform filters outliers from the input table. The metrics for determining outliers come from TD_OutlierFilterFit output.

Outlier filtering is a technique used to identify and remove outliers from a dataset in machine learning pipelines. The simple method of filtering outliers would be to calculate the 25th and 75th percentiles of the data, and removing the values below and above them respectively. The formulas are below:

25th percentile = 0.25*(N+1)

75th percentile = 0.75*(N+1)

where

N = number of data points

These measures provide information about the central tendency and spread of the data, respectively. While they can be useful in summarizing the distribution of the data, they do not provide information about extreme values that may be far from the median. Therefore, use more advanced methods like Tukey and Carling. Here is an example guide on how to apply these methods to filter outliers:

  1. Calculate the Inter-quartile range (IQR) of the dataset: The IQR is calculated by subtracting the 25th percentile from the 75th percentile of the data.
  2. Calculate the upper and lower bounds for outliers using the Tukey or Carling method:
    • For the Tukey method, the upper bound is calculated by adding 1.5 times the IQR to the 75th percentile, while the lower bound is calculated by subtracting 1.5 times the IQR from the 25th percentile.
    • For the Carling method, the upper bound is calculated by adding 3 times the IQR to the 75th percentile, while the lower bound is calculated by subtracting 3 times the IQR from the 25th percentile.
  3. Identify the outliers: Any data point that falls outside the upper and lower bounds is considered an outlier.
  4. Filter the outliers: Once the outliers have been identified, they can be removed from the dataset or replaced with another value using the methods discussed earlier.

Consider the following array of 10 data points:

[10, 12, 15, 17, 20, 22, 25, 30, 40, 100]

  1. The 25th percentile is 13.5 and the 75th percentile is 27.5. Therefore, the IQR is 14.
  2. The upper bound is 1.5 times the IQR added to the 75th percentile, which is 27.5 + (1.5 * 14) = 48.5. The lower bound is 1.5 times the IQR subtracted from the 25th percentile, which is 13.5 - (1.5 * 14) = -4.5.
  3. Any data point that falls outside the upper and lower bounds is considered an outlier. In this case, the data point 100 falls outside the upper bound and is therefore an outlier.
  4. The outlier can be removed from the dataset or replaced with mean, median, or other suitable values according to the type of data. In this case, remove the outlier to obtain the filtered dataset:

    [10, 12, 15, 17, 20, 22, 25, 30, 40]

The Tukey or Carling methods depend on the specific characteristics of the dataset. The Tukey method is more conservative and may result in fewer outliers being identified, while the Carling method is more sensitive and may identify more outliers.