Teradata Package for R Function Reference | 17.20 - OutlierFilterFit - Teradata Package for R - Look here for syntax, methods and examples for the functions included in the Teradata Package for R.

Teradata® Package for R Function Reference

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Teradata Package for R
Release Number
17.20
Published
March 2024
ft:locale
en-US
ft:lastEdition
2024-05-03
dita:id
TeradataR_FxRef_Enterprise_1720
Product Category
Teradata Vantage

OutlierFilterFit

Description

The td_outlier_filter_fit_sqle() function calculates the lower.percentile, upper.percentile, count of rows and median for all the "target.columns" provided by the user. These metrics for each column helps the function td_outlier_transform_sqle() detect outliers in the input table. It also stores parameters from arguments into a FIT table used during transformation.
Notes:

  • This function requires the UTF8 client character set for UNICODE data.

  • This function does not support Pass Through Characters (PTCs).

  • For information about PTCs, see Teradata Vantage™ - Analytics Database International Character Set Support.

  • This function does not support KanjiSJIS or Graphic data types.

Usage

  td_outlier_filter_fit_sqle (
      data = NULL,
      target.columns = NULL,
      group.columns = NULL,
      lower.percentile = 0.05,
      upper.percentile = 0.95,
      iqr.multiplier = 1.5,
      outlier.method = "PERCENTILE",
      replacement.value = "DELETE",
      remove.tail = "BOTH",
      percentile.method = "PERCENTILEDISC",
      ...
  )

Arguments

data

Required Argument.
Specifies the input tbl_teradata.
Types: tbl_teradata

target.columns

Required Argument.
Specifies the name(s) of the column(s) in "data" for which to compute the metrics.
Types: character OR vector of Strings (character)

group.columns

Optional Argument.
Specifies the input data column for which stats calculation needs to be grouped together.
Types: character

lower.percentile

Optional Argument.
Specifies lower range of percentile to be used to detect if value is outlier or not.
Default Value: 0.05
Types: integer

upper.percentile

Optional Argument.
Specifies upper range of percentile to be used to detect if value is outlier or not.
Default Value: 0.95
Types: integer

iqr.multiplier

Optional Argument.
Specifies the multiplier of interquartile range for "Tukey" filtering.
Default Value: 1.5
Types: integer

outlier.method

Optional Argument.
Specifies the method for filtering the outliers.
Permitted Values:

  • PERCENTILE - [min_value, max_value].

  • TUKEY - [Q1 - k*(Q3-Q1), Q1 + k*(Q3-Q1)] where:
    Q1 = 25th quartile of data Q3 = 75th quartile of data k = interquantile range multiplier (see "iqr.multiplier")

  • CARLING - Q2 ± c*(Q3-Q1) where: Q2 = median of data Q1 = 25th quartile of data Q3 = 75th quartile of data c = (17.63*r - 23.64) / (7.74*r - 3.71) r = count of rows in group.columns if you specify "group.columns", otherwise count of rows in "data"

Default Value: "PERCENTILE"
Types: character

replacement.value

Optional Argument.
Specifies the method to handle outliers.
Permitted Values:

  • DELETE - Do not copy row to output tbl_teradata.

  • NULL - Copy row to output tbl_teradata, replacing each outlier with NULL.

  • MEDIAN - Copy row to output tbl_teradata, replacing each outlier with median value for its group.

  • REPLACEMET VALUE - Copy row to output tbl_teradata, replacing each outlier with a replacement value. Replacement value must be numeric.

Default Value: "DELETE"
Types: character, integer, float

remove.tail

Optional Argument.
Specifies the tail of the distribution to remove.
Permitted Values:

  • LOWER - The lower tail.

  • UPPER - The upper tail.

  • BOTH - Both tails.

Default Value: "BOTH"
Types: character

percentile.method

Optional Argument.
Specifies the teradata percentile methods to be used for calculating the upper and lower percentiles of the "target.columns".
Permitted Values:

  • PERCENTILECONT - Considering continuous distribution.

  • PERCENTILEDISC - Considering discrete distibution.

Default Value: "PERCENTILEDISC"
Types: character

...

Specifies the generic keyword arguments SQLE functions accept. Below are the generic keyword arguments:

persist:
Optional Argument.
Specifies whether to persist the results of the
function in a table or not. When set to TRUE, results are persisted in a table; otherwise, results are garbage collected at the end of the session.
Default Value: FALSE
Types: logical

volatile:
Optional Argument.
Specifies whether to put the results of the
function in a volatile table or not. When set to TRUE, results are stored in a volatile table, otherwise not.
Default Value: FALSE
Types: logical

Function allows the user to partition, hash, order or local order the input data. These generic arguments are available for each argument that accepts tbl_teradata as input and can be accessed as:

  • "<input.data.arg.name>.partition.column" accepts character or vector of character (Strings)

  • "<input.data.arg.name>.hash.column" accepts character or vector of character (Strings)

  • "<input.data.arg.name>.order.column" accepts character or vector of character (Strings)

  • "local.order.<input.data.arg.name>" accepts logical

Note:
These generic arguments are supported by tdplyr if the underlying SQL Engine function supports, else an exception is raised.

Value

Function returns an object of class "td_outlier_filter_fit_sqle" which is a named list containing object of class "tbl_teradata".
Named list member(s) can be referenced directly with the "$" operator using the name(s):

  1. result

  2. output.data

Examples

  
    
    # Get the current context/connection.
    con <- td_get_context()$connection
    
    # Load the example data.
    loadExampleData("tdplyr_example", "titanic")
    
    # Create tbl_teradata object.
    titanic_data <- tbl(con, "titanic")
    
    # Check the list of available analytic functions.
    display_analytic_functions()
    
    # Example 1: Generating fit object to find outlier values in column "fare".
    OutlierFilterFit_out <- td_outlier_filter_fit_sqle(data = titanic_data,
                                                       target.columns = "fare")
    
    # Print the result.
    print(OutlierFilterFit_out$result)
    print(OutlierFilterFit_out$output.data)