The unsupervised model is created by specifying the user defined threshold values for the different metrics in the ComparisonFields syntax element. The initialization parameters InitialP, InitialM, InitialU, Lambda, and Mu are set to their default values. The column match_tag is not used for unsupervised learning.
Input
- InputTable: fstrainer_input, created by adding a match_tag column to the output of StringSimilarity_MLE Example: Compare src_text1 to tar_text:
DROP TABLE fstrainer_input; CREATE MULTISET TABLE fstrainer_input AS ( SELECT * FROM StringSimilarity_MLE ( ON strsimilarity_input PARTITION BY ANY USING ComparisonColumnPairs ( 'jaro (src_text1 , tar_text ) AS jaro1_sim', 'LD (src_text1 , tar_text, 2) AS ld1_sim', 'n_gram (src_text1 , tar_text, 2) AS ngram1_sim', 'jaro_winkler (src_text1 , tar_text, 2) AS jw1_sim' ) USING CaseSensitive ('true') Accumulate ('id','src_text1','tar_text') ) AS dt1 ) WITH DATA AS dt2 PARTITION BY id; ALTER TABLE fstrainer_input ADD column match_tag varchar; update fstrainer_input set match_tag= 'M' where id = 1; update fstrainer_input set match_tag= 'M' where id = 2; update fstrainer_input set match_tag= 'M' where id = 3; update fstrainer_input set match_tag= 'U' where id = 4; update fstrainer_input set match_tag= 'U' where id = 5; update fstrainer_input set match_tag= 'M' where id = 6; update fstrainer_input set match_tag= 'U' where id = 7; update fstrainer_input set match_tag= 'M' where id = 8; update fstrainer_input set match_tag= 'M' where id = 9; update fstrainer_input set match_tag= 'U' where id = 10; update fstrainer_input set match_tag= 'U' where id = 11; update fstrainer_input set match_tag= 'U' where id = 12; SELECT * FROM fstrainer_input ORDER BY 1;
The match_tag column is used for the supervised FellegiSunter function.
SQL Call
DROP TABLE fg_unsupervised_model; CREATE MULTISET TABLE "fg_unsupervised_model" AS ( SELECT * FROM FellegiSunter ( ON fstrainer_input AS InputTable USING ComparisonFields ('jaro1_sim: 0.8', 'ld1_sim:0.8', 'ngram1_sim:0.5', 'jw1_sim:0.8') InitialP (0.1) InitialM (0.9) InitialU (0.1) Lambda (0.9) Mu (0.9) ) AS dt ) WITH DATA;
Output
This query returns the following table:
SELECT * FROM fg_unsupervised_model ORDER BY 1;
_key _value ---------------------------- ------------------ comparison_filed_cnt 4 comparison_filed_name_0 jaro1_sim comparison_filed_name_1 ld1_sim comparison_filed_name_2 ngram1_sim comparison_filed_name_3 jw1_sim comparison_filed_threshold_0 0.8 comparison_filed_threshold_1 0.8 comparison_filed_threshold_2 0.5 comparison_filed_threshold_3 0.8 is_supervised false lambda 0.9 lower_bound -13.7207908385366 mu 0.9 m_0 0.9999999 m_1 0.333314834570551 m_2 0.999944448092125 m_3 0.9999999 p 0.272742408918643 time_used 104.089000 seconds upper_bound 24.9296836401616 u_0 0.749994796825927 u_1 1.0E-7 u_2 1.0E-7 u_3 0.874997398412963
Download a zip file of all examples and a SQL script file that creates their input tables.