DROP TABLE IF EXISTS fspredict_input;
CREATE FACT TABLE fspredict_input (PARTITION KEY (id)) AS
SELECT * FROM StringSimilarity (
ON strsimilarity_input PARTITION BY ANY
ComparisonColumnPairs (
'jaro (src_text2 , tar_text ) AS jaro1_sim',
'LD (src_text2 , tar_text, 2) AS ld1_sim',
'n_gram (src_text2 , tar_text, 2) AS ngram1_sim',
'jaro_winkler (src_text2 , tar_text, 2) AS jw1_sim'
)
CaseSensitive ('true')
Accumulate ('id','src_text2','tar_text')
);
SELECT * FROM fspredict_input ORDER BY 1;
FellegiSunterPredict Example Input Table fspredict_input (Columns 1-4)
id |
src_text2 |
tar_text |
jaro1_sim |
1 |
astter |
aster |
0.944444444444445 |
2 |
fone |
phone |
0.783333333333333 |
3 |
acquire |
acquiesce |
0.841269841269841 |
4 |
CCCGGGAACCAACC |
CCAGGGAAACCCAC |
0.875457875457875 |
5 |
allen |
allies |
0.822222222222222 |
6 |
angle |
angels |
0.877777777777778 |
7 |
center |
centre |
0.944444444444445 |
8 |
cheap |
chief |
0.733333333333333 |
9 |
circle |
circuit |
0.746031746031746 |
10 |
debut |
debris |
0.7 |
11 |
dell |
lead |
0.5 |
12 |
bear |
bear |
1 |
FellegiSunterPredict Example Input Table fspredict_input (Columns 5-7)
ld1_sim |
ngram1_sim |
jw1_sim |
0.833333333333333 |
0.8 |
0.961111111111111 |
0.6 |
0.5 |
0.783333333333333 |
0.666666666666667 |
0.5 |
0.904761904761905 |
0.714285714285714 |
0.692307692307692 |
0.9003663003663 |
0.666666666666667 |
0.4 |
0.875555555555556 |
0.666666666666667 |
0.4 |
0.914444444444445 |
0.666666666666667 |
0.6 |
0.966666666666667 |
0.4 |
0.25 |
0.786666666666667 |
0.571428571428571 |
0.5 |
0.847619047619048 |
0.5 |
0.4 |
0.79 |
0.25 |
0 |
0.5 |
1 |
1 |
1 |
The above input table compares the source column (src_txt2) with the reference column (tar_text) and gives the different similarity scores based on 'jaro', Levenshtein Distance, ngram and jaro-winkler metrics, as described in StringSimilarity. The tar_text column is the same as the input in FellegiSunterTrainer function, while 'src_txt2' is the new column required for prediction.