IdentityMatch Arguments - Aster Analytics

Teradata Aster® Analytics Foundation User GuideUpdate 2

Product
Aster Analytics
Release Number
7.00.02
Published
September 2017
Language
English (United States)
Last Update
2018-04-17
dita:mapPath
uce1497542673292.ditamap
dita:ditavalPath
AA-notempfilter_pdf_output.ditaval
dita:id
B700-1022
lifecycle
previous
Product Category
Software
IDColumn
Specifies the names of the columns in the source and reference input tables that contain row identifiers. The function copies these columns to the output table.
NominalMatchColumns
[Required if you omit FuzzyMatchColumns.] Specifies pairs of columns (attributes) to check for exact matching (a.columnX and b.columnY are column names). If any pair matches exactly, their records are considered to be exact matches.
FuzzyMatchColumns
[Required if you omit NominalMatchColumns.] Specifies pairs of columns (attributes) to check for fuzzy matching (a.columnX and b.columnY are column names) and the fuzzy matching parameters match_metric, match_weight, and synonym_file (whose descriptions follow). If any pair is a fuzzy match, their records are considered to be fuzzy matches.
The parameter match_metric specifies the similarity metric, which is a function that returns the similarity score of two strings (a value between 0 and 1). The possible values of match_metric are:
  • EQUAL

    If strings a and b are equal, their similarity score is 1.0; otherwise it is 0.0.

  • LD

    The similarity score of strings a and b is LD(a,b)/max(len(a),len(b)), where LD(a,b) is the Levenshtein distance between a and b.

  • D-LD

    Like LD except that LD is the Damerau–Levenshtein distance between a and b.

  • JARO

    The similarity score of strings a and b is the Jaro distance between them.

  • JARO-WINKLER

    The similarity score of strings a and b is the Jaro-Winkler distance between them.

  • NEEDLEMAN-WUNSCH

    The similarity score of strings a and b is the Needleman-Wunsch distance between them.

  • JD

    The similarity score of strings a and b is the Jaccard distance between them. The function converts the strings a and b to sets s and t by splitting them by space and then uses the formula f(s,t)=|s∩t|/|s∪t|.

  • COSINE

    The similarity score of strings a and b is calculated with their term frequency-inverse document frequency (TF-IDF) and cosine similarity.

    The function calculates IDF only on the input relation stored in memory.
The parameter match_weight specifies the weight (relative importance) of the attribute represented by a.columnX and b.columnY. The match_weight must be a positive number.

The function normalizes each match_weight to a value in the range [0, 1]. Given match_weight values, w 1, w 2, ..., w n, the normalized value of w i is:

w i/(w 1 +w 2 + ...+ w n)

For example, given two pairs of columns, whose match weights are 3 and 7, the function uses the weights 3/(3+7)=0.3 and 7/(3+7)=0.7 to compute the similarity score.

The parameter synonym_file specifies the dictionary in which the function checks the two strings for semantic equality. In the dictionary, each line is a comma-separated list of synonyms. Default behavior: The function does not use a dictionary.
You must install the dictionary before running the function.
Accumulate
[Optional] Specifies input table columns to copy to the output table.
Threshold
[Optional] Specifies the threshold similarity score, a DOUBLE PRECISION value between 0 and 1. Default: 0.5. The function outputs only the records whose similarity score exceeds threshold.