SQL to generate FellegiSunterTrainer input from output of StringSimilarity function

SQL to generate FellegiSunterTrainer input from output of StringSimilarity function - Aster Analytics

Teradata Aster Analytics Foundation User Guide

Product

Aster Analytics

Release Number

6.21

Published

November 2016

Language

English (United States)

Last Update

2018-04-14

dita:mapPath

kiu1466024880662.ditamap

dita:ditavalPath

AA-notempfilter_pdf_output.ditaval

dita:id

B700-1021

lifecycle

Product Category

Software

DROP TABLE IF EXISTS fstrainer_input;

CREATE FACT TABLE fstrainer_input (PARTITION KEY (id)) AS
SELECT * FROM StringSimilarity (
  ON strsimilarity_input PARTITION BY ANY
  ComparisonColumnPairs (
                'jaro (src_text1 , tar_text ) AS jaro1_sim',
                'LD (src_text1 , tar_text, 2) AS ld1_sim',
                'n_gram (src_text1 , tar_text, 2) AS ngram1_sim',
                'jaro_winkler (src_text1 , tar_text, 2) AS jw1_sim'
  )
  CaseSensitive ('true')
  Accumulate ('id','src_text1','tar_text')
);

ALTER TABLE fstrainer_input
ADD column match_tag varchar;

update fstrainer_input set match_tag=  'M' where id = 1;
update fstrainer_input set match_tag=  'M' where id = 2;
update fstrainer_input set match_tag=  'M' where id = 3;
update fstrainer_input set match_tag=  'U' where id = 4;
update fstrainer_input set match_tag=  'U' where id = 5;
update fstrainer_input set match_tag=  'M' where id = 6;
update fstrainer_input set match_tag=  'U' where id = 7;
update fstrainer_input set match_tag=  'M' where id = 8;
update fstrainer_input set match_tag=  'M' where id = 9;
update fstrainer_input set match_tag=  'U' where id = 10;
update fstrainer_input set match_tag=  'U' where id = 11;
update fstrainer_input set match_tag=  'U' where id = 12;

SELECT * FROM fstrainer_input ORDER BY 1;

FellegiSunterTrainer Example Input Table fstrainer_input (Columns 1-4)
id	src_text1	tar_text	jaro1_sim
1	astre	aster	0.933333333333333
2	hone	phone	0.933333333333333
3	acqiese	acquiesce	0.925925925925926
4	AAAACCCCCGGGGA	CCAGGGAAACCCAC	0.824175824175824
5	alice	allies	0.822222222222222
6	angela	angels	0.888888888888889
7	senter	centre	0.822222222222222
8	chef	chief	0.933333333333333
9	circus	circuit	0.849206349206349
10	debt	debris	0.75
11	deal	lead	0.666666666666667
12	bare	bear	0.833333333333333

FellegiSunterTrainer Example Input Table fstrainer_input (Columns 5-8)
ld1_sim	ngram1_sim	jw1_sim	match_tag
0.6	0.5	0.953333333333333	M
0.8	0.75	0.933333333333333	M
0.777777777777778	0.5	0.948148148148148	M
0.214285714285714	0.384615384615385	0.824175824175824	U
0.5	0.4	0.857777777777778	U
0.833333333333333	0.8	0.933333333333333	M
0.5	0.4	0.822222222222222	U
0.8	0.5	0.946666666666667	M
0.714285714285714	0.666666666666667	0.90952380952381	M
0.5	0.4	0.825	U
0.5	0.333333333333333	0.666666666666667	U
0.5	0.333333333333333	0.85	U

The above input table compares the source column (src_txt1) with the reference column (tar_text) and gives the different similarity scores based on 'jaro', Levenshtein Distance (LD), ngram and jaro-winkler metrics, as described in StringSimilarity.