The unsupervised model is created by specifying the user defined threshold values for the different metrics in the ComparisonFields argument. The initialization parameters InitialP, InitialM, InitialU, Lambda, and Mu are set to their default values. The column match_tag is not used for unsupervised learning.
Input
- InputTable: fstrainer_input, created by adding a match_tag column to the output of StringSimilarity Example 1: Compare src_text1 to tar_text:
DROP TABLE fstrainer_input; CREATE MULTISET TABLE fstrainer_input AS ( SELECT * FROM StringSimilarity ( ON strsimilarity_input PARTITION BY ANY USING ComparisonColumnPairs ( 'jaro (src_text1 , tar_text ) AS jaro1_sim', 'LD (src_text1 , tar_text, 2) AS ld1_sim', 'n_gram (src_text1 , tar_text, 2) AS ngram1_sim', 'jaro_winkler (src_text1 , tar_text, 2) AS jw1_sim' ) USING CaseSensitive ('true') Accumulate ('id','src_text1','tar_text') ) AS dt1 ) WITH DATA AS dt2 PARTITION BY id; ALTER TABLE fstrainer_input ADD column match_tag varchar; update fstrainer_input set match_tag= 'M' where id = 1; update fstrainer_input set match_tag= 'M' where id = 2; update fstrainer_input set match_tag= 'M' where id = 3; update fstrainer_input set match_tag= 'U' where id = 4; update fstrainer_input set match_tag= 'U' where id = 5; update fstrainer_input set match_tag= 'M' where id = 6; update fstrainer_input set match_tag= 'U' where id = 7; update fstrainer_input set match_tag= 'M' where id = 8; update fstrainer_input set match_tag= 'M' where id = 9; update fstrainer_input set match_tag= 'U' where id = 10; update fstrainer_input set match_tag= 'U' where id = 11; update fstrainer_input set match_tag= 'U' where id = 12; SELECT * FROM fstrainer_input ORDER BY 1;
The match_tag column is used for the supervised FellegiSunter function.
SQL Call
DROP TABLE fg_unsupervised_model; CREATE MULTISET TABLE "fg_unsupervised_model" AS ( SELECT * FROM FellegiSunter ( ON fstrainer_input AS InputTable USING ComparisonFields ('jaro1_sim: 0.8', 'ld1_sim:0.8', 'ngram1_sim:0.5', 'jw1_sim:0.8') InitialP (0.1) InitialM (0.9) InitialU (0.1) Lambda (0.9) Mu (0.9) ) AS dt ) WITH DATA;
Output
This query returns the following table:
SELECT * FROM fg_unsupervised_model ORDER BY 1;
_key | _value |
---|---|
comparison_field_cnt | 4 |
comparison_field_name_0 | jaro1_sim |
comparison_field_name_1 | ld1_sim |
comparison_field_name_2 | ngram1_sim |
comparison_field_name_3 | jw1_sim |
comparison_field_threshold_0 | 0.8 |
comparison_field_threshold_1 | 0.8 |
comparison_field_threshold_2 | 0.5 |
comparison_field_threshold_3 | 0.8 |
is_supervised | false |
lambda | 0.9 |
lower_bound | -13.7991041364018 |
m_0 | 0.9999999 |
m_1 | 0.333315282254649 |
m_2 | 0.999945434344193 |
m_3 | 0.9999999 |
mu | 0.9 |
p | 0.250013539042196 |
time_used | 271.020000 seconds |
u_0 | 0.777773766137301 |
u_1 | 1.0E-7 |
u_2 | 1.37483178043644E-7 |
u_3 | 0.88888688306865 |
upper_bound | 22.2091654088378 |