Arguments - Aster Analytics

Teradata Aster Analytics Foundation User Guide

Product
Aster Analytics
Release Number
6.21
Published
November 2016
Language
English (United States)
Last Update
2018-04-14
dita:mapPath
kiu1466024880662.ditamap
dita:ditavalPath
AA-notempfilter_pdf_output.ditaval
dita:id
B700-1021
lifecycle
previous
Product Category
Software
Argument Category Description
IDColumn Required Specifies the names of the columns in the source and reference input tables that contain row identifiers. The function copies these columns to the output table.
NominalMatchColumns Optional* Specifies pairs of columns (attributes) to check for exact matching (a.columnX and b.columnY are column names). If any pair matches exactly, then their records are considered to be exact matches.

*Required if you omit FuzzyMatchColumns.

FuzzyMatchColumns Optional* Specifies pairs of columns (attributes) to check for fuzzy matching (a.columnX and b.columnY are column names) and the fuzzy matching parameters match_metric, match_weight, and synonym_file (whose descriptions follow). If any pair is a fuzzy match, then their records are considered to be fuzzy matches.

*Required if you omit NominalMatchColumns.

The parameter match_metric specifies the similarity metric, which is a function that returns the similarity score of two strings (a value between 0 and 1). The possible values of match_metric are:
  • EQUAL:

    If strings a and b are equal, then their similarity score is 1.0; otherwise it 0.0.

  • LD:

    The similarity score of strings a and b is f(a,b)=LD(a,b)/max(len(a),len(b)), where LD(a,b) is the Levenshtein distance between a with b.

  • D-LD:

    Like LD except that LD is the Damerau–Levenshtein distance between a with b.

  • JARO:

    The similarity score of strings a and b is the Jaro distance between them.

  • JARO-WINKLER:

    The similarity score of strings a and b is the Jaro-Winkler distance between them.

  • NEEDLEMAN-WUNSCH:

    The similarity score of strings a and b is the Needleman-Wunsch distance between them.

  • JD:

    The similarity score of strings a and b is the Jaccard distance between them. The function converts the strings a and b to sets s and t by splitting them by space and then uses the formula f(s,t)=|s∩t|/|s∪t|.

  • COSINE:

    The similarity score of strings a and b is calculated with their term frequency–inverse document frequency (TF-IDF) and cosine similarity.

The function calculates IDF only on the input relation stored in memory.
    The parameter match_weight specifies the weight (relative importance) of the attribute represented by a.columnX and b.columnY. The match_weight must be a positive number.

The function normalizes each match_weight to a value in the range [0, 1]. Given match_weight values, w 1 , w 2 , ..., w n, the normalized value of w i is:

w i/(w 1 +w 2 + ...+ w n)

For example, given two pairs of columns, whose match weights are 3 and 7, the function uses the weights 3/(3+7)=0.3 and 7/(3+7)=0.7 to compute the similarity score.

    The parameter synonym_file specifies the dictionary that the function uses to check the two strings for semantic equality. In the dictionary, each line is a comma-separated list of synonyms.
You must install the dictionary before running the function.
Accumulate Optional Specifies input table columns to copy to the output table.
Threshold Optional Specifies the threshold similarity score, a DOUBLE PRECISION value between 0 and 1. The default value is 0.5. The function outputs only the records whose similarity score exceeds threshold.