7.00.02 - IdentityMatch Arguments - Aster Analytics

Teradata Aster® Analytics Foundation User GuideUpdate 2

Product
Aster Analytics
Release Number
7.00.02
Published
September 2017
Language
English (United States)
Last Update
2018-04-17
dita:mapPath
uce1497542673292.ditamap
dita:ditavalPath
AA-notempfilter_pdf_output.ditaval
dita:id
zuk1466006200888
IDColumn
Specifies the names of the columns in the source and reference input tables that contain row identifiers. The function copies these columns to the output table.
NominalMatchColumns
[Required if you omit FuzzyMatchColumns.] Specifies pairs of columns (attributes) to check for exact matching (a.columnX and b.columnY are column names). If any pair matches exactly, their records are considered to be exact matches.
FuzzyMatchColumns
[Required if you omit NominalMatchColumns.] Specifies pairs of columns (attributes) to check for fuzzy matching (a.columnX and b.columnY are column names) and the fuzzy matching parameters match_metric, match_weight, and synonym_file (whose descriptions follow). If any pair is a fuzzy match, their records are considered to be fuzzy matches.
The parameter match_metric specifies the similarity metric, which is a function that returns the similarity score of two strings (a value between 0 and 1). The possible values of match_metric are:
  • EQUAL

    If strings a and b are equal, their similarity score is 1.0; otherwise it is 0.0.

  • LD

    The similarity score of strings a and b is LD(a,b)/max(len(a),len(b)), where LD(a,b) is the Levenshtein distance between a and b.

  • D-LD

    Like LD except that LD is the Damerau–Levenshtein distance between a and b.

  • JARO

    The similarity score of strings a and b is the Jaro distance between them.

  • JARO-WINKLER

    The similarity score of strings a and b is the Jaro-Winkler distance between them.

  • NEEDLEMAN-WUNSCH

    The similarity score of strings a and b is the Needleman-Wunsch distance between them.

  • JD

    The similarity score of strings a and b is the Jaccard distance between them. The function converts the strings a and b to sets s and t by splitting them by space and then uses the formula f(s,t)=|s∩t|/|s∪t|.

  • COSINE

    The similarity score of strings a and b is calculated with their term frequency-inverse document frequency (TF-IDF) and cosine similarity.

    The function calculates IDF only on the input relation stored in memory.
The parameter match_weight specifies the weight (relative importance) of the attribute represented by a.columnX and b.columnY. The match_weight must be a positive number.

The function normalizes each match_weight to a value in the range [0, 1]. Given match_weight values, w 1, w 2, ..., w n, the normalized value of w i is:

w i/(w 1 +w 2 + ...+ w n)

For example, given two pairs of columns, whose match weights are 3 and 7, the function uses the weights 3/(3+7)=0.3 and 7/(3+7)=0.7 to compute the similarity score.

The parameter synonym_file specifies the dictionary in which the function checks the two strings for semantic equality. In the dictionary, each line is a comma-separated list of synonyms. Default behavior: The function does not use a dictionary.
You must install the dictionary before running the function.
Accumulate
[Optional] Specifies input table columns to copy to the output table.
Threshold
[Optional] Specifies the threshold similarity score, a DOUBLE PRECISION value between 0 and 1. Default: 0.5. The function outputs only the records whose similarity score exceeds threshold.