Argument | Category | Description |
---|---|---|
IDColumn | Required | Specifies the names of the columns in the source and reference input tables that contain row identifiers. The function copies these columns to the output table. |
NominalMatchColumns | Optional* | Specifies pairs of columns (attributes) to check for exact matching (a.columnX and b.columnY are column names). If any pair matches exactly, then their records are considered to be exact matches. *Required if you omit FuzzyMatchColumns. |
FuzzyMatchColumns | Optional* | Specifies pairs of columns (attributes) to check for fuzzy matching (a.columnX and b.columnY are column names) and the fuzzy matching parameters match_metric, match_weight, and synonym_file (whose descriptions follow). If any pair is a fuzzy match, then their records are considered to be fuzzy matches. *Required if you omit NominalMatchColumns. The parameter match_metric specifies the similarity metric, which is a function that returns the similarity score of two strings (a value between 0 and 1). The possible values of match_metric are:
The function calculates IDF only on the input relation stored in memory.
|
The parameter match_weight specifies the weight (relative importance) of the attribute represented by a.columnX and b.columnY. The match_weight must be a positive number. The function normalizes each match_weight to a value in the range [0, 1]. Given match_weight values, w 1 , w 2 , ..., w n, the normalized value of w i is: w i/(w 1 +w 2 + ...+ w n) For example, given two pairs of columns, whose match weights are 3 and 7, the function uses the weights 3/(3+7)=0.3 and 7/(3+7)=0.7 to compute the similarity score. |
||
The parameter synonym_file specifies the dictionary that the function uses to check the two strings for semantic equality. In the dictionary, each line is a comma-separated list of synonyms. You must install the dictionary before running the function.
|
||
Accumulate | Optional | Specifies input table columns to copy to the output table. |
Threshold | Optional | Specifies the threshold similarity score, a DOUBLE PRECISION value between 0 and 1. The default value is 0.5. The function outputs only the records whose similarity score exceeds threshold. |