StringSimilarity Example 1: Compare src_text1 to tar_text - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
8.00
1.0
Published
May 2019
Language
English (United States)
Last Update
2019-11-22
dita:mapPath
blj1506016597986.ditamap
dita:ditavalPath
blj1506016597986.ditaval
dita:id
B700-4003
lifecycle
previous
Product Category
Teradata Vantageā„¢

Input

The input table, strsimilarity_input, has two source columns (src_text1 and src_text2) to which the function compares the target column (tar_text). The function calculates the similarity scores by the methods specified by the ComparisonColumnPairs argument (jaro, jaro-winkler, NGrams, Levenshtein Distance). For clarity, separate examples show the comparison of each source column with the target column. With some modifications, you can use the output of this function as input to the FellegiSunter functions.

strsimilarity_input
id src_text1 src_text2 tar_text
1 astre astter aster
2 hone fone phone
3 acqiese acquire acquiesce
4 AAAACCCCCGGGGA CCCGGGAACCAACC CCAGGGAAACCCAC
5 alice allen allies
6 angela angle angels
7 senter center centre
8 chef cheap chief
9 circus circle circuit
10 debt debut debris
11 deal dell lead
12 bare bear bear

SQL Call

SELECT * FROM StringSimilarity (
  ON strsimilarity_input PARTITION BY ANY
  USING
  ComparisonColumnPairs ('jaro (src_text1 , tar_text ) AS jaro1_sim',
                         'LD (src_text1 , tar_text, 2) AS ld1_sim',
                         'n_gram (src_text1 , tar_text, 2) AS ngram1_sim',
                         'jaro_winkler (src_text1 , tar_text, 2) AS jw1_sim'
  )
  CaseSensitive ('true')
  Accumulate ('id', 'src_text1', 'tar_text')
) AS dt ORDER BY id;

Output

id src_text1 tar_text jaro1_sim ld1_sim ngram1_sim jw1_sim
1 astre aster 0.933333333333333 0.6 0.5 0.953333333333333
2 hone phone 0.933333333333333 0.8 0.75 0.933333333333333
3 acqiese acquiesce 0.925925925925926 0.777777777777778 0.5 0.948148148148148
4 AAAACCCCCGGGGA CCAGGGAAACCCAC 0.824175824175824 0.214285714285714 0.384615384615385 0.824175824175824
5 alice allies 0.822222222222222 0.5 0.4 0.857777777777778
6 angela angels 0.888888888888889 0.833333333333333 0.8 0.933333333333333
7 senter centre 0.822222222222222 0.5 0.4 0.822222222222222
8 chef chief 0.933333333333333 0.8 0.5 0.946666666666667
9 circus circuit 0.849206349206349 0.714285714285714 0.666666666666667 0.90952380952381
10 debt debris 0.75 0.5 0.4 0.825
11 deal lead 0.666666666666667 0.5 0.333333333333333 0.666666666666667
12 bare bear 0.833333333333333 0.5 0.333333333333333 0.85