IdentityMatch Function Syntax Elements | Teradata Vantage - IdentityMatch Syntax Elements - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
9.02
9.01
2.0
1.3
Published
February 2022
Language
English (United States)
Last Update
2022-02-10
dita:mapPath
rnn1580259159235.ditamap
dita:ditavalPath
ybt1582220416951.ditaval
dita:id
B700-4003
lifecycle
previous
Product Category
Teradata Vantage™
IDColumn
Specify the names of the columns in SourceTable and ReferenceTable that contain row identifiers. The function copies these columns to the output table.
NominalMatchColumns
[Required if you omit FuzzyMatchColumns.] Specify pairs of columns (attributes) to check for exact matching (a.columnX and b.columnY are column names). If any pair matches exactly, their records are considered exact matches, and the function does not compare the fuzzy match attributes.
FuzzyMatchColumns
[Required if you omit NominalMatchColumns.] Specify pairs of columns (attributes) to check for fuzzy matching (a.columnX and b.columnY are column names) and fuzzy matching parameters. If any pair is a fuzzy match, their records are considered fuzzy matches, and the function reports the similarity score of these attributes.
Fuzzy matching parameters:
match_metric
Specify the similarity metric, which is a function that returns the similarity score of two strings (a value between 0 and 1), as follows:
match_metric Similarity Score
EQUAL 1.0 if strings a and b are equal; otherwise 0.0.
LD LD(a,b)/max(len(a),len(b)), where LD(a,b) is the Levenshtein distance between a and b.
D-LD LD(a,b)/max(len(a),len(b)), where LD(a,b) is the Damerau-Levenshtein distance between a and b.
JARO Jaro distance between a and b.
JARO-WINKLER Jaro-Winkler distance between a and b.
NEEDLEMAN-WUNSCH Needleman-Wunsch distance between a and b.
JD Jaccard distance between a and b. The function converts the strings a and b to sets s and t by splitting them by space and then uses the formula f(s,t)=|s∩t|/|s∪t|.
COSINE Calculated with term frequency-inverse document frequency (TF-IDF) and cosine similarity of a and b.

The function calculates IDF only on the input relation stored in memory.

match_weight
Specify the weight (relative importance) of the attribute represented by a.columnX and b.columnY. The match_weight must be a positive number.

The function normalizes each match_weight to a value in the range [0, 1]. Given match_weight values, w1, w2, ..., wn, this is the formula for the normalized value of wi:

wi/(w1+w2+ ...+ wn)

For example, given two pairs of columns, whose match weights are 3 and 7, the function uses the weights 3/(3+7)=0.3 and 7/(3+7)=0.7 to compute the similarity score.

synonym_file
Specify the dictionary in which the function checks the two strings for semantic equality. In the dictionary, each line is a comma-separated list of synonyms.
You must install the dictionary on ML Engine before running the function.
Default behavior: The function does not use a dictionary.
NullHandling
[Optional] Specify function behavior for strings with null values:
Option Description
'mismatch' (default) If strings do not match, their similarity score is 0.
'match-if-null' If one string is null and the other is not, their similarity score is 1.
'match-if-both-null' If both strings are null, their similarity score is 1.
Accumulate
[Optional] Specify SourceTable and ReferenceTable columns to copy to the output table.
ThresholdScore
[Optional] Specify the threshold similarity score, a DOUBLE PRECISION value between 0 and 1. The function outputs only the records whose similarity score exceeds threshold. The higher the threshold, the higher the matching accuracy.
Default: 0.5