SQL to generate FellegiSunterTrainer input from StringSimilarity function - Aster Analytics

Teradata Aster Analytics Foundation User Guide

Product
Aster Analytics
Release Number
6.21
Published
November 2016
Language
English (United States)
Last Update
2018-04-14
dita:mapPath
kiu1466024880662.ditamap
dita:ditavalPath
AA-notempfilter_pdf_output.ditaval
dita:id
B700-1021
lifecycle
previous
Product Category
Software
DROP TABLE IF EXISTS fspredict_input;

CREATE FACT TABLE fspredict_input (PARTITION KEY (id)) AS
SELECT * FROM StringSimilarity (
  ON strsimilarity_input PARTITION BY ANY
  ComparisonColumnPairs (
                  'jaro (src_text2 , tar_text ) AS jaro1_sim',
                  'LD (src_text2 , tar_text, 2) AS ld1_sim',
                  'n_gram (src_text2 , tar_text, 2) AS ngram1_sim',
                  'jaro_winkler (src_text2 , tar_text, 2) AS jw1_sim'
  )
  CaseSensitive ('true')
  Accumulate ('id','src_text2','tar_text')
);

SELECT * FROM fspredict_input ORDER BY 1;
FellegiSunterPredict Example Input Table fspredict_input (Columns 1-4)
id src_text2 tar_text jaro1_sim
1 astter aster 0.944444444444445
2 fone phone 0.783333333333333
3 acquire acquiesce 0.841269841269841
4 CCCGGGAACCAACC CCAGGGAAACCCAC 0.875457875457875
5 allen allies 0.822222222222222
6 angle angels 0.877777777777778
7 center centre 0.944444444444445
8 cheap chief 0.733333333333333
9 circle circuit 0.746031746031746
10 debut debris 0.7
11 dell lead 0.5
12 bear bear 1
FellegiSunterPredict Example Input Table fspredict_input (Columns 5-7)
ld1_sim ngram1_sim jw1_sim
0.833333333333333 0.8 0.961111111111111
0.6 0.5 0.783333333333333
0.666666666666667 0.5 0.904761904761905
0.714285714285714 0.692307692307692 0.9003663003663
0.666666666666667 0.4 0.875555555555556
0.666666666666667 0.4 0.914444444444445
0.666666666666667 0.6 0.966666666666667
0.4 0.25 0.786666666666667
0.571428571428571 0.5 0.847619047619048
0.5 0.4 0.79
0.25 0 0.5
1 1 1

The above input table compares the source column (src_txt2) with the reference column (tar_text) and gives the different similarity scores based on 'jaro', Levenshtein Distance, ngram and jaro-winkler metrics, as described in StringSimilarity. The tar_text column is the same as the input in FellegiSunterTrainer function, while 'src_txt2' is the new column required for prediction.