StringSimilarity Syntax Elements - Teradata® Database

Database Analytic Functions

Product
Teradata® Database
Release Number
17.10
Published
July 2021
Language
English (United States)
Last Update
2021-07-28
dita:mapPath
Teradata_Vantage™___Advanced_SQL_Engine_Analytic_Functions.withLogo_upload_July2021/wnd1589838592459.ditamap
dita:ditavalPath
Teradata_Vantage™___Advanced_SQL_Engine_Analytic_Functions.withLogo_upload_July2021/ayr1485454803741.ditaval
dita:id
B035-1206
lifecycle
previous
Product Category
Teradata Vantage™
ComparisonColumnPairs
Specify the names of the input table columns that contain strings to compare (column1 and column2), how to compare them (comparison_type), and (optionally) a constant and the name of the output column for their similarity (output_column). The similarity is a value in the range [0, 1].
For column1 and column2:
  • If column1 or column2 includes any special characters (that is, characters other than letters, digits, or underscore (_)), surround the column name with double quotation marks. For example, if column1 and column2 are c(col1) and c(col2), respectively, specify them as "c(col1)" and "c(col2)".
    If column1 or column2 includes double quotation marks, replace each double quotation mark with a pair of double quotation marks. For example, if column1 and column2 are c1"c and c2"c, respectively, specify them as "c1""c" and "c2""c".
    These rules do not apply to output_column. For example, this is valid syntax: ComparisonColumnPairs ('jaro ("c1""c", "c2""c") AS out"col')
  • If column1 or column2 supports more than 200 characters, you can cast it to VARCHAR(200), as in the following example; however, the string may be truncated. For information about the CAST operation, see Teradata Vantage™ - SQL Functions, Expressions, and Predicates, B035-1145.
    SELECT * FROM StringSimilarity (
      ON (
        SELECT id, CAST(a AS VARCHAR(200)) AS a, CAST(b AS VARCHAR(200)) AS b
        FROM max_varchar_strlen
      ) PARTITION BY ANY
      USING
      ComparisonColumnPairs ('ld(a,b) AS sim_fn')
      Accumulate ('id')
    ) AS dt ORDER BY 1;
    
For comparison_type, use one of these values:
comparison_type Description
'jaro' Jaro distance.
'jaro_winkler' Jaro-Winkler distance: 1 for an exact match, 0 otherwise. If you specify this comparison type, you can specify the value of factor p with constant. 0 ≤ p ≤ 0.25. Default: p = 0.1
'n_gram' N-gram similarity. If you specify this comparison type, you can specify the value of N with constant. Default: N = 2
'LD' Levenshtein distance: Number of edits needed to transform one string into the other. Edits are insertions, deletions, or substitutions of individual characters.
'LDWS' Levenshtein distance without substitution: Number of edits needed to transform one string into the other using only insertions or deletions of individual characters.
'OSA' Optimal string alignment distance: Number of edits needed to transform one string into the other. Edits are insertions, deletions, substitutions, or transpositions of characters. A substring can be edited only once.
'DL' Damerau-Levenshtein distance: Like 'OSA' except that a substring can be edited any number of times.
'hamming' Hamming distance: For strings of equal length, number of positions where corresponding characters differ (that is, minimum number of substitutions needed to transform one string into the other). For strings of unequal length, -1.
'LCS' Longest common substring: Length of longest substring common to both strings.
'jaccard' Jaccard indexed-based comparison.
'cosine' Cosine similarity.
'soundexcode' Only for English strings: -1 if either string has a non-English character; otherwise, 1 if their soundex codes are the same and 0 otherwise.
The function ignores constant for every comparison_type except 'jaro_winkler' and 'n_gram'.
You can specify a different comparison_type for every pair of columns.
Default: output_column is 'sim_i', where i is the sequence number of the column pair.
CaseSensitive
[Optional] Specify whether string comparison is case-sensitive. You can specify either one value for all pairs or one value for each pair. If you specify one value for each pair, the ith value applies to the ith pair.
Default: 'false'
Accumulate
[Optional] Specify the names of input table columns to copy to the output table.