1.1 - 8.10 - StringSimilarity_MLE Example: Compare src_text1 to tar_text - Teradata Vantage

Teradata Vantage™ - Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
1.1
8.10
Release Date
October 2019
Content Type
Programming Reference
Publication ID
B700-4003-079K
Language
English (United States)

Input

The input table, strsimilarity_input, has two source columns (src_text1 and src_text2) to which the function compares the target column (tar_text). The function calculates the similarity scores by the methods specified by the ComparisonColumnPairs syntax element (jaro, jaro-winkler, NGramSplitter_MLE, Levenshtein Distance). For clarity, separate examples show the comparison of each source column with the target column. With some modifications, you can use the output of this function as input to the FellegiSunter functions.

 id |   src_text1    |   src_text2    |    tar_text    
----+----------------+----------------+----------------
  2 | hone           | fone           | phone
  5 | alice          | allen          | allies
  7 | senter         | center         | centre
 10 | debt           | debut          | debris
  3 | acqiese        | acquire        | acquiesce
  6 | angela         | angle          | angels
  8 | chef           | cheap          | chief
 11 | deal           | dell           | lead
  4 | AAAACCCCCGGGGA | CCCGGGAACCAACC | CCAGGGAAACCCAC
  9 | circus         | circle         | circuit
 12 | bare           | bear           | bear
(11 rows)

SQL Call

SELECT * FROM StringSimilarity_MLE (
  ON strsimilarity_input PARTITION BY ANY
  USING
  ComparisonColumnPairs ('jaro (src_text1 , tar_text ) AS jaro1_sim',
                         'LD (src_text1 , tar_text, 2) AS ld1_sim',
                         'n_gram (src_text1 , tar_text, 2) AS ngram1_sim',
                         'jaro_winkler (src_text1 , tar_text) AS jw1_sim'
  )
  CaseSensitive ('true')
  Accumulate ('id', 'src_text1', 'tar_text')
) AS dt ORDER BY id;

Output

 id src_text1      tar_text       jaro1_sim          ld1_sim             ngram1_sim          jw1_sim            
 -- -------------- -------------- ------------------ ------------------- ------------------- ------------------ 
  2 hone           phone          0.9333333333333332                 0.8                0.75 0.9333333333333332
  3 acqiese        acquiesce      0.9259259259259259  0.7777777777777778                 0.5 0.9481481481481482
  4 aaaacccccgggga ccagggaaacccac 0.8241758241758242 0.21428571428571427 0.38461538461538464 0.8241758241758242
  5 alice          allies         0.8222222222222223                 0.5                 0.4 0.8577777777777779
  6 angela         angels          0.888888888888889  0.8333333333333334                 0.8 0.9333333333333333
  7 senter         centre         0.8222222222222223                 0.5                 0.4 0.8222222222222223
  8 chef           chief          0.9333333333333332                 0.8                 0.5 0.9466666666666665
  9 circus         circuit        0.8492063492063492  0.7142857142857143  0.6666666666666666 0.9095238095238095
 10 debt           debris                       0.75                 0.5                 0.4              0.825
 11 deal           lead           0.6666666666666666                 0.5  0.3333333333333333 0.6666666666666666
 12 bare           bear           0.8333333333333334                 0.5  0.3333333333333333 0.8500000000000001

Download a zip file of all examples and a SQL script file that creates their input tables from the attachment in the left sidebar.