OutlierFilter Syntax Elements - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
8.10
1.1
Published
October 2019
Language
English (United States)
Last Update
2019-12-31
dita:mapPath
ima1540829771750.ditamap
dita:ditavalPath
jsj1481748799576.ditaval
dita:id
B700-4003
lifecycle
previous
Product Category
Teradata Vantage™
OutputTable
Specify the name of the table where the function stores the copy of the InputTable (including the PARTITION BY column) with the outliers either deleted (by default) or replaced (as specified by the ReplacementValue syntax element).
OutliersTable
[Optional] Specify the name of the table where the function outputs copies of the rows of the InputTable that contain outliers.
Default behavior: Function does not output an outlier table.
TargetColumns
Specify the names of the InputTable columns that contain numeric data to filter.
GroupByColumns
[Optional] Specify the names of the InputTable columns by which to group the data. If the data schema format is name:value, this list must include name.
Default behavior: Function does not group data.
OutlierMethod
[Optional] Specify one or more of the following methods of filtering outliers:
method Description
'percentile' (Default) Percentile.
'tukey' Tukey's test:

An outlier is defined as any observation smaller than V1 - k*(V3-V1) or larger than V3 + k*(V3-V1), where V1 and V3 are 25th and 75th percentiles of data and k is specified by IQRMultiplier syntax element.

'carling' Carling's modification to Tukey's test:

An outlier is defined as an observation outside the range V2 ± c*(V3 - V1), where V2 is median of data, V1 and V3 are 25th and 75th percentiles of data, and c is constant (which you cannot change).

For more information about Carling's modification, see:

Carling, Kenneth. "Resistant outlier rules and the non-Gaussian case." Computational Statistics and Data Analysis 33, no. 3 (2000): 249-258. Available at https://core.ac.uk/download/pdf/6559387.pdf.

'MAD-median' Median absolute deviation (MAD), median of absolute values of residuals.

For example, for i datapoints and median value of data M, MAD=median i (|x i -M |).

Specify either one method, which the function uses for all columns specified by TargetColumns, or specify a method for each column specified by TargetColumns.

ApproxPercentile
[Optional] Specify whether the function calculates the percentiles used as filter limits exactly. Approximate percentiles are typically faster, but can fail when the number of groups exceeds one million.
Default: 'false'
PercentileThreshold
[Optional] Specify the range of percentile values for 'percentile' filtering, [perc_lower, 100 -perc_lower].
Default: [5, 95]
PercentAccuracy
[Optional] Specify the accuracy of percentiles used for filtering. The value accuracy must be in the range [0.01, 50].
Default: 0.5%
IQRMultiplier
[Optional] Specify the multiplier of interquartile range for 'tukey' filtering.
Default: 1.5
RemoveTail
[Optional] Specify the side of the distribution to filter.
Default: 'both'
ReplacementValue
[Optional] Specify how the function handles outliers:
Option Description
'delete' (Default) Function does not copy row to output table.
'null' Function copies row to output table, replacing each outlier with value NULL.
'median' Function copies row to output table, replacing each outlier with median value for its group.
newval Function copies row to output table, replacing each outlier with newval, which must be a numeric value.
MadScaleConstant
[Optional] Specify the scale constant used with 'MAD-median' filtering; a DOUBLE PRECISION value.
Default: 1.4826, which means MAD = 1.4826 * median(|x - median(x)|)
MadThreshold
[Optional] Specify the threshold used with 'MAD-median' filtering; a DOUBLE PRECISION value.
Default: 3, which means that |x-median(x)|/MAD > 3 is flagged as an outlier