Teradata R Package Function Reference - StringSimilarity - Teradata R Package - Look here for syntax, methods and examples for the functions included in the Teradata R Package.

Teradata® R Package Function Reference

Product
Teradata R Package
Release Number
16.20
Published
February 2020
Language
English (United States)
Last Update
2020-02-28
dita:id
B700-4007
lifecycle
previous
Product Category
Teradata Vantage

Description

The StringSimilarity function calculates the similarity between two strings, using either the Jaro, Jaro-Winkler, N-Gram, or Levenshtein distance. The similarity is a value in the range [0, 1].
Note: This function is only available when tdplyr is connected to Vantage 1.1 or later versions.

Usage

  td_string_similarity_sqle (
      data = NULL,
      comparison.columns = NULL,
      case.sensitive = NULL,
      accumulate = NULL,
      data.order.column = NULL
  )

Arguments

data

Required Argument.
Specifies the relation that contains the string pairs to be compared.

data.order.column

Optional Argument.
Specifies Order By columns for data. Values to this argument can be provided as vector, if multiple columns are used for ordering.
Types: character OR vector of Strings (character)

comparison.columns

Required Argument.
Specifies pairs of input tbl_teradata columns that contain strings to be compared (column1 and column2), how to compare them (comparison type), and (optionally) a constant and the name of the output column for their similarity (output column). The similarity is a value in the range [0, 1]. For comparison type, use one of these values:

  1. "jaro": Jaro distance.

  2. "jaro_winkler": Jaro-Winkler distance (1 for an exact match, 0 otherwise). If you specify this comparison type, you can specify the value of factor p with constant. 0 <= p <= 0.25.
    Default: p = 0.1

  3. "n_gram": N-gram similarity. If you specify this comparison type, you can specify the value of N with a constant.

  4. "LD": Levenshtein distance. The number of edits needed to transform one string into the other. Possible edits are insertions, deletions, or substitutions of individual characters.

  5. "LDWS": Levenshtein distance without substitution. Number of edits needed to transform one string into the other using only insertions or deletions of individual characters.

  6. "OSA": Optimal string alignment distance. Number of edits needed to transform one string into the other. Possible edits are insertions, deletions, substitutions, or transpositions of characters. A substring can be edited only once.

  7. "DL": Damerau-Levenshtein distance. Similar to OSA, except that a substring can be edited any number of times.

  8. "hamming": Hamming distance. For strings of equal length, it is the number of positions where corresponding characters differ (that is, minimum number of substitutions needed to transform one string into the other). For strings of unequal length, -1.

  9. "LCS": The longest common substring. Length of longest substring common to both strings.

  10. "jaccard": Jaccard indexed-based comparison.

  11. "cosine": Cosine similarity.

  12. "soundexcode": Only for English strings: -1 if either string has a non-English character; otherwise, 1 if their soundex codes are the same and 0 otherwise.

You can specify a different comparison type for every pair of columns. The default output_column is "sim_i", where i is the sequence number of the column pair.
Types: character OR vector of characters

case.sensitive

Optional Argument.
Specifies whether string comparison is case-sensitive. You can specify one value for all pairs or one value for each pair. If you specify one value for each pair, then the ith value applies to the ith pair.
Default Value: FALSE
Types: logical OR vector of logicals

accumulate

Optional Argument.
Specifies the names of input tbl_teradata columns to be copied to the output table.
Types: character OR vector of Strings (character)

Value

Function returns an object of class "td_string_similarity_sqle" which is a named list containing Teradata tbl object.
Named list member can be referenced directly with the "$" operator using name: result.

Examples

    # Get the current context/connection
    con <- td_get_context()$connection
    
    # Load example data.
    loadExampleData("stringsimilarity_example", "strsimilarity_input")
    
    # Create remote tibble objects.
    strsimilarity_input <- tbl(con, "strsimilarity_input")
    
    # Example 1 - Using "jaro" comparison type with a default output column
    td_string_similarity_sqle_out <- td_string_similarity_sqle(data = strsimilarity_input,
                                                               case.sensitive = TRUE,
                                                               comparison.columns = c("jaro (src_text2, tar_text) AS jaro2_case_sim"),
                                                               accumulate = c("id","src_text1","tar_text")
                                                               )
    
    # Example 2 - Using multiple comparison types and with custom output columns
    td_string_similarity_sqle_out2 <- td_string_similarity_sqle(data = strsimilarity_input,
                                                          comparison.columns = c("jaro (src_text1, tar_text) AS jaro1_sim",
                                                          "LD (src_text1, tar_text, 2) AS ld1_sim",
                                                          "n_gram (src_text1, tar_text, 2) AS ngram1_sim",
                                                          "jaro_winkler (src_text1, tar_text, 0.2) AS jw1_sim"),
                                                          case.sensitive = TRUE,
                                                          accumulate = c("id","src_text1","tar_text")
                                                         )