Arguments - Aster Analytics

Teradata Aster Analytics Foundation User Guide

Product: Aster Analytics
Release Number: 6.21
Published: November 2016
Language: English (United States)
Last Update: 2018-04-14
dita:mapPath: kiu1466024880662.ditamap
dita:ditavalPath: AA-notempfilter_pdf_output.ditaval
dita:id: B700-1021
lifecycle: previous
Product Category: Software

Argument	Category	Description
IDColumn	Required	Specifies the names of the columns in the source and reference input tables that contain row identifiers. The function copies these columns to the output table.
NominalMatchColumns	Optional*	Specifies pairs of columns (attributes) to check for exact matching (a.columnX and b.columnY are column names). If any pair matches exactly, then their records are considered to be exact matches. *Required if you omit FuzzyMatchColumns.
FuzzyMatchColumns	Optional*	Specifies pairs of columns (attributes) to check for fuzzy matching (a.columnX and b.columnY are column names) and the fuzzy matching parameters match_metric, match_weight, and synonym_file (whose descriptions follow). If any pair is a fuzzy match, then their records are considered to be fuzzy matches. *Required if you omit NominalMatchColumns. The parameter match_metric specifies the similarity metric, which is a function that returns the similarity score of two strings (a value between 0 and 1). The possible values of match_metric are: EQUAL: If strings a and b are equal, then their similarity score is 1.0; otherwise it 0.0. LD: The similarity score of strings a and b is f(a,b)=LD(a,b)/max(len(a),len(b)), where LD(a,b) is the Levenshtein distance between a with b. D-LD: Like LD except that LD is the Damerau–Levenshtein distance between a with b. JARO: The similarity score of strings a and b is the Jaro distance between them. JARO-WINKLER: The similarity score of strings a and b is the Jaro-Winkler distance between them. NEEDLEMAN-WUNSCH: The similarity score of strings a and b is the Needleman-Wunsch distance between them. JD: The similarity score of strings a and b is the Jaccard distance between them. The function converts the strings a and b to sets s and t by splitting them by space and then uses the formula f(s,t)=\|s∩t\|/\|s∪t\|. COSINE: The similarity score of strings a and b is calculated with their term frequency–inverse document frequency (TF-IDF) and cosine similarity. The function calculates IDF only on the input relation stored in memory.
		The parameter match_weight specifies the weight (relative importance) of the attribute represented by a.columnX and b.columnY. The match_weight must be a positive number. The function normalizes each match_weight to a value in the range [0, 1]. Given match_weight values, w 1 , w 2 , ..., w n, the normalized value of w i is: w i/(w 1 +w 2 + ...+ w n) For example, given two pairs of columns, whose match weights are 3 and 7, the function uses the weights 3/(3+7)=0.3 and 7/(3+7)=0.7 to compute the similarity score.
		The parameter synonym_file specifies the dictionary that the function uses to check the two strings for semantic equality. In the dictionary, each line is a comma-separated list of synonyms. You must install the dictionary before running the function.
Accumulate	Optional	Specifies input table columns to copy to the output table.
Threshold	Optional	Specifies the threshold similarity score, a DOUBLE PRECISION value between 0 and 1. The default value is 0.5. The function outputs only the records whose similarity score exceeds threshold.