Teradata R Package Function Reference | 17.00 - 17.00 - StringSimilarity - Teradata R Package

Teradata® R Package Function Reference

prodname
Teradata R Package
vrm_release
17.00
created_date
September 2020
category
Programming Reference
featnum
B700-4007-090K

Description

The StringSimilarity function calculates the similarity between two strings, using either the Jaro, Jaro-Winkler, N-Gram, or Levenshtein distance. The similarity is a value in the range [0, 1].
Note: This function is only available when tdplyr is connected to Vantage 1.1 or later versions.

Usage

  td_string_similarity_sqle (
      data = NULL,
      comparison.columns = NULL,
      case.sensitive = NULL,
      accumulate = NULL,
      data.order.column = NULL
  )

Arguments

data

Required Argument.
Specifies the relation that contains the string pairs to be compared.

data.order.column

Optional Argument.
Specifies Order By columns for "data".
Values to this argument can be provided as a vector, if multiple columns are used for ordering.
Types: character OR vector of Strings (character)

comparison.columns

Required Argument.
Specifies pairs of input tbl_teradata columns that contain strings to be compared (column1 and column2), how to compare them (comparison type), and (optionally) a constant and the name of the output column for their similarity (output column). The similarity is a value in the range [0, 1]. For comparison type, use one of these values:

  1. "jaro": Jaro distance.

  2. "jaro_winkler": Jaro-Winkler distance (1 for an exact match, 0 otherwise). If you specify this comparison type, you can specify the value of factor p with constant. 0 ≤ p ≤ 0.25. Default: p = 0.1.

  3. "n_gram": N-gram similarity. If you specify this comparison type, you can specify the value of N with a constant.

  4. "LD": Levenshtein distance. (the number of edits needed to transform one string into the other, Possible edits are insertions, deletions, or substitutions of individual characters.

  5. "LDWS": Levenshtein distance without substitution. Number of edits needed to transform one string into the other using only insertions or deletions of individual characters.

  6. "OSA": Optimal string alignment distance. Number of edits needed to transform one string into the other. Possible edits are insertions, deletions, substitutions, or transpositions of characters. A substring can be edited only once.

  7. "DL": Damerau-Levenshtein distance. Similar to OSA, except that a substring can be edited any number of times.

  8. "hamming": Hamming distance. For strings of equal length, it is the number of positions where corresponding characters differ (that is, minimum number of substitutions needed to transform one string into the other). For strings of unequal length, -1.

  9. "LCS": The longest common substring. Length of longest substring common to both strings.

  10. "jaccard": Jaccard indexed-based comparison.

  11. "cosine": Cosine similarity.

  12. "soundexcode": Only for English strings: -1 if either string has a non-English character; otherwise, 1 if their soundex codes are the same and 0 otherwise.

You can specify a different comparison type for every pair of columns. The default output_column is "sim_i", where i is the sequence number of the column pair.
Types: character OR vector of characters

case.sensitive

Optional Argument.
Specifies whether string comparison is case-sensitive. You can specify one value for all pairs or one value for each pair. If you specify one value for each pair, then the ith value applies to the ith pair.
Default Value: FALSE
Types: logical OR vector of logicals

accumulate

Optional Argument.
Specifies the names of input tbl_teradata columns to be copied to the output tbl_teradata
Types: character OR vector of Strings (character)

Value

Function returns an object of class "td_string_similarity_sqle" which is a named list containing object of class "tbl_teradata".
Named list member can be referenced directly with the "$" operator using the name: result.

Examples

    # Get the current context/connection
    con <- td_get_context()$connection
    
    # Load example data.
    loadExampleData("stringsimilarity_example", "strsimilarity_input")
    
    # Create object(s) of class "tbl_teradata".
    strsimilarity_input <- tbl(con, "strsimilarity_input")
    
    # Example 1 - Using "jaro" comparison type with a default output column.
    td_string_similarity_sqle_out <- td_string_similarity_sqle(data = strsimilarity_input,
                                                               case.sensitive = TRUE,
                                                               comparison.columns = c("jaro (src_text2, tar_text) AS jaro2_case_sim"),
                                                               accumulate = c("id","src_text1","tar_text")
                                                               )
    
    # Example 2 - Using multiple comparison types and with custom output columns.
    td_string_similarity_sqle_out2 <- td_string_similarity_sqle(data = strsimilarity_input,
                                                          comparison.columns = c("jaro (src_text1, tar_text) AS jaro1_sim",
                                                          "LD (src_text1, tar_text, 2) AS ld1_sim",
                                                          "n_gram (src_text1, tar_text, 2) AS ngram1_sim",
                                                          "jaro_winkler (src_text1, tar_text, 0.2) AS jw1_sim"),
                                                          case.sensitive = TRUE,
                                                          accumulate = c("id","src_text1","tar_text")
                                                         )