- Specify the name of the table where the function stores the copy of the input table (including the PARTITION BY column) with the outliers either deleted (by default) or replaced (as specified by the ReplacementValue argument).
- [Optional] Specify the name of the table where the function outputs copies of the rows of the input table that contain outliers.
- Default behavior: Function does not output an outlier table.
- Specify the names of the input table columns that contain numeric data to filter.
- [Optional] Specify the names of the input table columns by which to group the data. If the data schema format is name:value, this list must include name.
- Default behavior: Function does not group data.
- [Optional] Specify one or more of the following methods of filtering outliers:
method Description 'percentile' (Default) Percentile. 'tukey' Tukey's test:
An outlier is defined as any observation smaller than V1 - k*(V3-V1) or larger than V3 + k*(V3-V1), where V1 and V3 are 25th and 75th percentiles of data and k is specified by IQRMultiplier argument.
'carling' Carling's modification to Tukey's test:
An outlier is defined as an observation outside the range V2 ± c*(V3 - V1), where V2 is median of data, V1 and V3 are 25th and 75th percentiles of data, and c is constant (which you cannot change).
For more information about Carling's modification, see:
Carling, Kenneth. "Resistant outlier rules and the non-Gaussian case." Computational Statistics and Data Analysis 33, no. 3 (2000): 249-258. Available at https://core.ac.uk/download/pdf/6559387.pdf.
'MAD-median' Median absolute deviation (MAD), median of absolute values of residuals.
For example, for i datapoints and median value of data M, MAD=median i (|x i -M |).
Specify either one method, which the function uses for all columns specified by TargetColumns, or specify a method for each column specified by TargetColumns.
- [Optional] Specify whether the function calculates the percentiles used as filter limits exactly. Approximate percentiles are typically faster, but can fail when the number of groups exceeds one million.
- Default: 'false'
- [Optional] Specify the range of percentile values for 'percentile' filtering, [perc_lower, 100 -perc_lower].
- Default: [5, 95]
- [Optional] Specify the accuracy of percentiles used for filtering.
- Default: 0.5%
- [Optional] Specify the multiplier of interquartile range for 'tukey' filtering.
- Default: 1.5
- [Optional] Specify the side of the distribution to filter.
- Default: 'both'
- [Optional] Specify how the function handles outliers:
Option Description 'delete' (Default) Function does not copy row to output table. 'null' Function copies row to output table, replacing each outlier with value NULL. 'median' Function copies row to output table, replacing each outlier with median value for its group. newval Function copies row to output table, replacing each outlier with newval, which must be a numeric value.
- [Optional] Specify the scale constant used with 'MAD-median' filtering; a DOUBLE PRECISION value.
- Default: 1.4826, which means MAD = 1.4826 * median(|x - median(x)|)
- [Optional] Specify the threshold used with 'MAD-median' filtering; a DOUBLE PRECISION value.
- Default: 3, which means that |x-median(x)|/MAD > 3 is flagged as an outlier