TD_OutlierFilterTransform Function | OutlierFilterTransform - TD_OutlierFilterTransform - Analytics Database

Database Analytic Functions

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Analytics Database
Release Number
17.20
Published
June 2022
Language
English (United States)
Last Update
2024-10-04
dita:mapPath
gjn1627595495337.ditamap
dita:ditavalPath
ayr1485454803741.ditaval
dita:id
jmh1512506877710
lifecycle
latest
Product Category
Teradata Vantageā„¢

TD_OutlierFilterTransform filters outliers from the input table. The metrics for determining outliers come from TD_OutlierFilterFit output.

Outlier filtering is a technique used to identify and remove outliers from a dataset in machine learning pipelines. The simple method of filtering outliers would be to calculate the 25th and 75th percentiles of the data, and removing the lower and upper values from them respectively. The formulas are:

25th percentile = 0.25*(N+1)

75th percentile = 0.75*(N+1)

where

N = number of data points

These measures provide information about the central tendency and spread of the data, respectively. While they can be useful in summarizing the distribution of the data, they do not provide information about extreme values that may be far from the median. Therefore, use more advanced methods like Tukey and Carling. Here is an example guide on how to apply these methods to filter outliers:

  1. Calculate the Inter-quartile range (IQR) of the dataset: The IQR is calculated by subtracting the 25th percentile from the 75th percentile of the data.
  2. Calculate the upper and lower bounds for outliers using the Tukey or Carling method:
    • For the Tukey method, the upper bound is calculated by adding 1.5 times the IQR to the 75th percentile, while the lower bound is calculated by subtracting 1.5 times the IQR from the 25th percentile.
    • For the Carling method, the upper bound is calculated by adding 3 times the IQR to the 75th percentile, while the lower bound is calculated by subtracting 3 times the IQR from the 25th percentile.
  3. Identify the outliers: Any data point that falls outside the upper and lower bounds is considered an outlier.
  4. Filter the outliers: Once the outliers have been identified, they can be removed from the dataset or replaced with another value using the methods discussed earlier.

Consider the following array of 10 data points:

[10, 12, 15, 17, 20, 22, 25, 30, 40, 100]

  1. The 25th percentile is 13.5 and the 75th percentile is 27.5. Therefore, the IQR is 14.
  2. The upper bound is 1.5 times the IQR added to the 75th percentile, which is 27.5 + (1.5 * 14) = 48.5. The lower bound is 1.5 times the IQR subtracted from the 25th percentile, which is 13.5 - (1.5 * 14) = -4.5.
  3. Any data point that falls outside the upper and lower bounds is considered an outlier. In this case, the data point 100 falls outside the upper bound and is therefore an outlier.
  4. The outlier can be removed from the dataset or replaced with mean, median, or other suitable values according to the type of data. In this case, remove the outlier to obtain the filtered dataset:

    [10, 12, 15, 17, 20, 22, 25, 30, 40]

The Tukey or Carling methods depend on the specific characteristics of the dataset. The Tukey method is more conservative and may result in fewer outliers being identified, while the Carling method is more sensitive and may identify more outliers.