The objective of the example is to find the credit scores of job applicants by matching the information in two tables.
The example looks for exact matches (NominalMatchColumns) to the email address and approximate matches (FuzzyMatchColumns) for lastname, firstname, zip code, city, and department, with different match metrics and match weights.
Input
- a: applicant_reference, which has hypothetical information from job applicants
- b: applicant_external, an external source table, which has missing and incomplete information, but includes credit scores
id | firstname | lastname | city | zipcode | department | gender | |
---|---|---|---|---|---|---|---|
1 | John | Dewey | john.dewey@corp-mark.com | Sugar Land | 77459 | Marketing | Male |
2 | Sarah | Anders | sarah.anders@corp-sales.com | Pearland | 77584 | Sales | Female |
3 | Elizabeth | Hall | elizabeth.hall@corp-eng.com | Galveston | 77550 | Engineering | Female |
4 | James | Nickson | james.nick@corp-it.com | Pasadena | 77501 | IT | Male |
5 | Kim | Lee | kim.lee@corp-sys.com | Clear Lake City | 77058 | Systems | Female |
6 | Jessica | Right | jessica.right@corp-mark.com | Sugar Land | 77459 | Marketing | Female |
id | firstname | lastname | city | zipcode | department | creditscore | |
---|---|---|---|---|---|---|---|
1 | John | Dewey | john.dewey@corp-mark.com | Sugar Land | 7774 | market | 700 |
2 | Hall | Galveston | 77550 | eng | 790 | ||
3 | Sarah | Anders | sarah.anders@corp-sales.com | pear | 77584 | sales | 650 |
4 | Jessica | right | Sugar Land | 77459 | Marketing | 690 | |
5 | James | Nickson | Pasadena | 7750 | IT | 620 | |
6 | Kim | 77058 | system | 570 |
SQL Call
SELECT * FROM IdentityMatch ( ON applicant_reference AS a PARTITION BY ANY ON applicant_external AS b DIMENSION USING IDColumn ('a.id: b.id') NominalMatchColumns ('a.email: b.email') FuzzyMatchColumns ('a.lastname: b.lastname, JARO-WINKLER, 3', 'a.firstname: b.firstname, JARO-WINKLER, 2', 'a.zipcode: b.zipcode, JD, 2', 'a.city: b.city, LD, 2', 'a.department: b.department, COSINE, 1') Accumulate ('a.firstname','a.lastname' ,'b.lastname', 'a.email', 'b.email','a.zipcode', 'b.zipcode', 'a.department', 'b.department','b.creditscore') ThresholdScore (0.5) ) AS dt ORDER BY "a.id", score DESC;
Output
The output table has the matching information from both input tables and the matching score. If multiple row entries exist for the same id, the higher scoring entry gives the best match. For example, the first row (with score 1) is a better match than the second (with score 0.6036). The creditscore column contains applicant credit scores.
a.id | a.firstname | a.lastname | a.email | a.zipcode | a.department | b.id | b.lastname | b.email | b.zipcode | b.department | b.creditscore | score |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | John | Dewey | john.dewey@corp-mark.com | 77459 | Marketing | 1 | Dewey | john.dewey@corp-mark.com | 7774 | market | 700 | 1.0000 |
1 | John | Dewey | john.dewey@corp-mark.com | 77459 | Marketing | 4 | right | 77459 | Marketing | 690 | 0.6036 | |
2 | Sarah | Anders | sarah.anders@corp-sales.com | 77584 | Sales | 3 | Anders | sarah.anders@corp-sales.com | 77584 | sales | 650 | 1.0000 |
3 | Elizabeth | Hall | elizabeth.hall@corp-eng.com | 77550 | Engineering | 2 | Hall | 77550 | eng | 790 | 0.7000 | |
4 | James | Nickson | james.nick@corp-it.com | 77501 | IT | 5 | Nickson | 7750 | IT | 620 | 0.8000 | |
6 | Jessica | Right | jessica.right@corp-mark.com | 77459 | Marketing | 4 | right | 77459 | Marketing | 690 | 1.0000 |