SQL to generate FellegiSunterTrainer input from StringSimilarity function

SQL to generate FellegiSunterTrainer input from StringSimilarity function - Aster Analytics

Teradata Aster Analytics Foundation User Guide

Product

Aster Analytics

Release Number

6.21

Published

November 2016

Language

English (United States)

Last Update

2018-04-14

dita:mapPath

kiu1466024880662.ditamap

dita:ditavalPath

AA-notempfilter_pdf_output.ditaval

dita:id

B700-1021

lifecycle

Product Category

Software

DROP TABLE IF EXISTS fspredict_input;

CREATE FACT TABLE fspredict_input (PARTITION KEY (id)) AS
SELECT * FROM StringSimilarity (
  ON strsimilarity_input PARTITION BY ANY
  ComparisonColumnPairs (
                  'jaro (src_text2 , tar_text ) AS jaro1_sim',
                  'LD (src_text2 , tar_text, 2) AS ld1_sim',
                  'n_gram (src_text2 , tar_text, 2) AS ngram1_sim',
                  'jaro_winkler (src_text2 , tar_text, 2) AS jw1_sim'
  )
  CaseSensitive ('true')
  Accumulate ('id','src_text2','tar_text')
);

SELECT * FROM fspredict_input ORDER BY 1;

FellegiSunterPredict Example Input Table fspredict_input (Columns 1-4)
id	src_text2	tar_text	jaro1_sim
1	astter	aster	0.944444444444445
2	fone	phone	0.783333333333333
3	acquire	acquiesce	0.841269841269841
4	CCCGGGAACCAACC	CCAGGGAAACCCAC	0.875457875457875
5	allen	allies	0.822222222222222
6	angle	angels	0.877777777777778
7	center	centre	0.944444444444445
8	cheap	chief	0.733333333333333
9	circle	circuit	0.746031746031746
10	debut	debris	0.7
11	dell	lead	0.5
12	bear	bear	1

FellegiSunterPredict Example Input Table fspredict_input (Columns 5-7)
ld1_sim	ngram1_sim	jw1_sim
0.833333333333333	0.8	0.961111111111111
0.6	0.5	0.783333333333333
0.666666666666667	0.5	0.904761904761905
0.714285714285714	0.692307692307692	0.9003663003663
0.666666666666667	0.4	0.875555555555556
0.666666666666667	0.4	0.914444444444445
0.666666666666667	0.6	0.966666666666667
0.4	0.25	0.786666666666667
0.571428571428571	0.5	0.847619047619048
0.5	0.4	0.79
0.25	0	0.5
1	1	1

The above input table compares the source column (src_txt2) with the reference column (tar_text) and gives the different similarity scores based on 'jaro', Levenshtein Distance, ngram and jaro-winkler metrics, as described in StringSimilarity. The tar_text column is the same as the input in FellegiSunterTrainer function, while 'src_txt2' is the new column required for prediction.