FellegiSunter Example: Unsupervised Learning | Teradata Vantage - FellegiSunter Example: Unsupervised Learning - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
9.02
9.01
2.0
1.3
Published
February 2022
Language
English (United States)
Last Update
2022-02-10
dita:mapPath
rnn1580259159235.ditamap
dita:ditavalPath
ybt1582220416951.ditaval
dita:id
B700-4003
lifecycle
previous
Product Category
Teradata Vantageā„¢

The unsupervised model is created by specifying the user defined threshold values for the different metrics in the ComparisonFields syntax element. The initialization parameters InitialP, InitialM, InitialU, Lambda, and Mu are set to their default values. The column match_tag is not used for unsupervised learning.

Input

  • InputTable: fstrainer_input, created by adding a match_tag column to the output of StringSimilarity_MLE Example: Compare src_text1 to tar_text:
    DROP TABLE fstrainer_input;
    
    CREATE MULTISET TABLE fstrainer_input AS (
      SELECT * FROM StringSimilarity_MLE (
        ON strsimilarity_input PARTITION BY ANY
        USING
        ComparisonColumnPairs (
                    'jaro (src_text1 , tar_text ) AS jaro1_sim',
                    'LD (src_text1 , tar_text, 2) AS ld1_sim',
                    'n_gram (src_text1 , tar_text, 2) AS ngram1_sim',
                    'jaro_winkler (src_text1 , tar_text, 2) AS jw1_sim'
        )
        USING
        CaseSensitive ('true')
        Accumulate ('id','src_text1','tar_text')
      ) AS dt1
    ) WITH DATA AS dt2 PARTITION BY id;
    
    ALTER TABLE fstrainer_input
    ADD column match_tag varchar;
    
    update fstrainer_input set match_tag=  'M' where id = 1;
    update fstrainer_input set match_tag=  'M' where id = 2;
    update fstrainer_input set match_tag=  'M' where id = 3;
    update fstrainer_input set match_tag=  'U' where id = 4;
    update fstrainer_input set match_tag=  'U' where id = 5;
    update fstrainer_input set match_tag=  'M' where id = 6;
    update fstrainer_input set match_tag=  'U' where id = 7;
    update fstrainer_input set match_tag=  'M' where id = 8;
    update fstrainer_input set match_tag=  'M' where id = 9;
    update fstrainer_input set match_tag=  'U' where id = 10;
    update fstrainer_input set match_tag=  'U' where id = 11;
    update fstrainer_input set match_tag=  'U' where id = 12;
    
    SELECT * FROM fstrainer_input ORDER BY 1;

    The match_tag column is used for the supervised FellegiSunter function.

SQL Call

DROP TABLE fg_unsupervised_model;

CREATE MULTISET TABLE "fg_unsupervised_model" AS (
  SELECT * FROM FellegiSunter (
    ON fstrainer_input AS InputTable
    USING
    ComparisonFields ('jaro1_sim: 0.8', 'ld1_sim:0.8', 'ngram1_sim:0.5',
                      'jw1_sim:0.8')
    InitialP (0.1)
    InitialM (0.9)
    InitialU (0.1)
    Lambda (0.9)
    Mu (0.9)
  ) AS dt
) WITH DATA;

Output

This query returns the following table:

SELECT * FROM fg_unsupervised_model ORDER BY 1;
 _key                         _value             
 ---------------------------- ------------------ 
 comparison_filed_cnt         4                 
 comparison_filed_name_0      jaro1_sim         
 comparison_filed_name_1      ld1_sim           
 comparison_filed_name_2      ngram1_sim        
 comparison_filed_name_3      jw1_sim           
 comparison_filed_threshold_0 0.8               
 comparison_filed_threshold_1 0.8               
 comparison_filed_threshold_2 0.5               
 comparison_filed_threshold_3 0.8               
 is_supervised                false             
 lambda                       0.9               
 lower_bound                  -13.7207908385366 
 mu                           0.9               
 m_0                          0.9999999         
 m_1                          0.333314834570551 
 m_2                          0.999944448092125 
 m_3                          0.9999999         
 p                            0.272742408918643 
 time_used                    104.089000 seconds
 upper_bound                  24.9296836401616  
 u_0                          0.749994796825927 
 u_1                          1.0E-7            
 u_2                          1.0E-7            
 u_3                          0.874997398412963

Download a zip file of all examples and a SQL script file that creates their input tables.