- IDColumn
- Specifies the names of the columns in the source and reference input tables that contain row identifiers. The function copies these columns to the output table.
- NominalMatchColumns
- [Required if you omit FuzzyMatchColumns.] Specifies pairs of columns (attributes) to check for exact matching (a.columnX and b.columnY are column names). If any pair matches exactly, their records are considered to be exact matches.
- FuzzyMatchColumns
- [Required if you omit NominalMatchColumns.] Specifies pairs of columns (attributes) to check for fuzzy matching (a.columnX and b.columnY are column names) and the fuzzy matching parameters match_metric, match_weight, and synonym_file (whose descriptions follow). If any pair is a fuzzy match, their records are considered to be fuzzy matches.The parameter match_metric specifies the similarity metric, which is a function that returns the similarity score of two strings (a value between 0 and 1). The possible values of match_metric are:The parameter match_weight specifies the weight (relative importance) of the attribute represented by a.columnX and b.columnY. The match_weight must be a positive number.
-
EQUAL
If strings a and b are equal, their similarity score is 1.0; otherwise it is 0.0.
-
LD
The similarity score of strings a and b is LD(a,b)/max(len(a),len(b)), where LD(a,b) is the Levenshtein distance between a and b.
-
D-LD
Like LD except that LD is the Damerau–Levenshtein distance between a and b.
-
JARO
The similarity score of strings a and b is the Jaro distance between them.
-
JARO-WINKLER
The similarity score of strings a and b is the Jaro-Winkler distance between them.
-
NEEDLEMAN-WUNSCH
The similarity score of strings a and b is the Needleman-Wunsch distance between them.
-
JD
The similarity score of strings a and b is the Jaccard distance between them. The function converts the strings a and b to sets s and t by splitting them by space and then uses the formula f(s,t)=|s∩t|/|s∪t|.
-
COSINE
The similarity score of strings a and b is calculated with their term frequency-inverse document frequency (TF-IDF) and cosine similarity.
The function calculates IDF only on the input relation stored in memory.
The function normalizes each match_weight to a value in the range [0, 1]. Given match_weight values, w 1, w 2, ..., w n, the normalized value of w i is:
w i/(w 1 +w 2 + ...+ w n)
For example, given two pairs of columns, whose match weights are 3 and 7, the function uses the weights 3/(3+7)=0.3 and 7/(3+7)=0.7 to compute the similarity score.
The parameter synonym_file specifies the dictionary in which the function checks the two strings for semantic equality. In the dictionary, each line is a comma-separated list of synonyms. Default behavior: The function does not use a dictionary.You must install the dictionary before running the function. -
EQUAL
- Accumulate
- [Optional] Specifies input table columns to copy to the output table.
- Threshold
- [Optional] Specifies the threshold similarity score, a DOUBLE PRECISION value between 0 and 1. Default: 0.5. The function outputs only the records whose similarity score exceeds threshold.