StringSimilarity_MLE Example: Compare src_text1 to tar_text | Teradata Vantage - StringSimilarity_MLE Example: Compare src_text1 to tar_text - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
9.02
9.01
2.0
1.3
Published
February 2022
Language
English (United States)
Last Update
2022-02-10
dita:mapPath
rnn1580259159235.ditamap
dita:ditavalPath
ybt1582220416951.ditaval
dita:id
B700-4003
lifecycle
previous
Product Category
Teradata Vantageā„¢

Input

The input table, strsimilarity_input, has two source columns (src_text1 and src_text2) to which the function compares the target column (tar_text). The function calculates the similarity scores by the methods specified by the ComparisonColumnPairs syntax element (jaro, jaro-winkler, NGramSplitter_MLE, Levenshtein Distance). For clarity, separate examples show the comparison of each source column with the target column. With some modifications, you can use the output of this function as input to the FellegiSunter functions.

 id |   src_text1    |   src_text2    |    tar_text    
----+----------------+----------------+----------------
  2 | hone           | fone           | phone
  5 | alice          | allen          | allies
  7 | senter         | center         | centre
 10 | debt           | debut          | debris
  3 | acqiese        | acquire        | acquiesce
  6 | angela         | angle          | angels
  8 | chef           | cheap          | chief
 11 | deal           | dell           | lead
  4 | AAAACCCCCGGGGA | CCCGGGAACCAACC | CCAGGGAAACCCAC
  9 | circus         | circle         | circuit
 12 | bare           | bear           | bear
(11 rows)

SQL Call

SELECT * FROM StringSimilarity_MLE (
  ON strsimilarity_input PARTITION BY ANY
  USING
  ComparisonColumnPairs ('jaro (src_text1 , tar_text ) AS jaro1_sim',
                         'LD (src_text1 , tar_text) AS ld1_sim',
                         'n_gram (src_text1 , tar_text, 2) AS ngram1_sim',
                         'jaro_winkler (src_text1 , tar_text) AS jw1_sim'
  )
  CaseSensitive ('true')
  Accumulate ('id', 'src_text1', 'tar_text')
) AS dt ORDER BY id;

Output

 id src_text1      tar_text       jaro1_sim          ld1_sim             ngram1_sim          jw1_sim            
 -- -------------- -------------- ------------------ ------------------- ------------------- ------------------ 
  2 hone           phone          0.9333333333333332                 0.8                0.75 0.9333333333333332
  3 acqiese        acquiesce      0.9259259259259259  0.7777777777777778                 0.5 0.9481481481481482
  4 aaaacccccgggga ccagggaaacccac 0.8241758241758242 0.21428571428571427 0.38461538461538464 0.8241758241758242
  5 alice          allies         0.8222222222222223                 0.5                 0.4 0.8577777777777779
  6 angela         angels          0.888888888888889  0.8333333333333334                 0.8 0.9333333333333333
  7 senter         centre         0.8222222222222223                 0.5                 0.4 0.8222222222222223
  8 chef           chief          0.9333333333333332                 0.8                 0.5 0.9466666666666665
  9 circus         circuit        0.8492063492063492  0.7142857142857143  0.6666666666666666 0.9095238095238095
 10 debt           debris                       0.75                 0.5                 0.4              0.825
 11 deal           lead           0.6666666666666666                 0.5  0.3333333333333333 0.6666666666666666
 12 bare           bear           0.8333333333333334                 0.5  0.3333333333333333 0.8500000000000001

Download a zip file of all examples and a SQL script file that creates their input tables.