The objective of the example is to find the credit scores of job applicants by matching the information in two tables.
The example looks for exact matches (NominalMatchColumns) to the email address and approximate matches (FuzzyMatchColumns) for lastname, firstname, zip code, city, and department, with different match metrics and match weights.
Input
- SourceTable: applicant_reference, which has hypothetical information from job applicants
- ReferenceTable: applicant_external, an external source table, which has missing and incomplete information, but includes credit scores
id | firstname | lastname | city | zipcode | department | gender | |
---|---|---|---|---|---|---|---|
1 | John | Dewey | john.dewey@corp-mark.com | Sugar Land | 77459 | Marketing | Male |
2 | Sarah | Anders | sarah.anders@corp-sales.com | Pearland | 77584 | Sales | Female |
3 | Elizabeth | Hall | elizabeth.hall@corp-eng.com | Galveston | 77550 | Engineering | Female |
4 | James | Nickson | james.nick@corp-it.com | Pasadena | 77501 | IT | Male |
5 | Kim | Lee | kim.lee@corp-sys.com | Clear Lake City | 77058 | Systems | Female |
6 | Jessica | Right | jessica.right@corp-mark.com | Sugar Land | 77459 | Marketing | Female |
id | firstname | lastname | city | zipcode | department | creditscore | |
---|---|---|---|---|---|---|---|
1 | John | Dewey | john.dewey@corp-mark.com | Sugar Land | 7774 | market | 700 |
2 | Hall | Galveston | 77550 | eng | 790 | ||
3 | Sarah | Anders | sarah.anders@corp-sales.com | pear | 77584 | sales | 650 |
4 | Jessica | right | Sugar Land | 77459 | Marketing | 690 | |
5 | James | Nickson | Pasadena | 7750 | IT | 620 | |
6 | Kim | 77058 | system | 570 |
SQL Call
SELECT * FROM IdentityMatch ( ON applicant_reference AS SourceTable PARTITION BY ANY ON applicant_external AS ReferenceTable DIMENSION USING IDColumn ('a.id: b.id') NominalMatchColumns ('a.email: b.email') FuzzyMatchColumns ('a.lastname: b.lastname, JARO-WINKLER, 3', 'a.firstname: b.firstname, JARO-WINKLER, 2', 'a.zipcode: b.zipcode, JD, 2', 'a.city: b.city, LD, 2', 'a.department: b.department, COSINE, 1') Accumulate ('a.firstname','a.lastname' ,'b.lastname', 'a.email', 'b.email','a.zipcode', 'b.zipcode', 'a.department', 'b.department','b.creditscore') ThresholdScore (0.5) ) AS dt;
Output
The output table has the matching information from both input tables and the matching score. If multiple row entries exist for the same id, the higher scoring entry gives the best match. For example, the first row (with score 1) is a better match than the second (with score 0.6036). The creditscore column contains applicant credit scores.
a.id a.firstname a.lastname a.email a.zipcode a.department b.id b.lastname b.email b.zipcode b.department b.creditscore score ---- ----------- ---------- --------------------------- --------- ------------ ---- ---------- --------------------------- --------- ------------ ------------- ------ 6 jessica right jessica.right@corp-mark.com 77459 marketing 4 right 77459 marketing 690.0 1 4 james nickson james.nick@corp-it.com 77501 it 5 nickson 7750 it 620.0 0.8 3 elizabeth hall elizabeth.hall@corp-eng.com 77550 engineering 2 hall 77550 eng 790.0 0.7 1 john dewey john.dewey@corp-mark.com 77459 marketing 4 right 77459 marketing 690.0 0.6036 1 john dewey john.dewey@corp-mark.com 77459 marketing 1 dewey john.dewey@corp-mark.com 7774 market 700.0 1 2 sarah anders sarah.anders@corp-sales.com 77584 sales 3 anders sarah.anders@corp-sales.com 77584 sales 650.0 1
Download a zip file of all examples and a SQL script file that creates their input tables.