Teradata Package for R Function Reference | 17.00 - StringSimilarity - Teradata Package for R - Look here for syntax, methods and examples for the functions included in the Teradata Package for R.

Teradata® Package for R Function Reference

Product
Teradata Package for R
Release Number
17.00
Published
July 2021
Language
English (United States)
Last Update
2023-08-08
dita:id
B700-4007
NMT
no
Product Category
Teradata Vantage
StringSimilarity

Description

The StringSimilarity function calculates the similarity between two strings, using either the Jaro, Jaro-Winkler, N-Gram, or Levenshtein distance. The similarity is a value in the range [0, 1].

Usage

  td_string_similarity_mle (
    data = NULL,
    comparison.columns = NULL,
    case.sensitive = NULL,
    accumulate = NULL,
    data.sequence.column = NULL,
    data.order.column = NULL
  )

Arguments

data

Required Argument.
Specifies the input tbl_teradatda object that contains the string pairs to be compared.

data.order.column

Optional Argument.
Specifies Order By columns for "data".
Values to this argument can be provided as a vector, if multiple columns are used for ordering.
Types: character OR vector of Strings (character)

comparison.columns

Required Argument.
Syntax: comparison_type (column1, column2 [, constant]) AS output_column.
Specifies pairs of columns (column1 and column2) in the input argument "data", that contain strings to be compared, how to compare them (comparison_type), and (optionally) a constant and the name of the output column for their similarity (output_column). The similarity is a value in the range [0, 1].
For comparison_type, use one of these values:

  1. "jaro": Jaro distance.

  2. "jaro_winkler": Jaro-Winkler distance (1 for an exact match, 0 otherwise).

  3. "n-gram": N-gram similarity. If you specify this comparison type, you can specify the value of N with constant.

  4. "LD": Levenshtein distance (the number of edits needed to transform one string into the other, where edits include insertions, deletions, or substitutions of individual characters).

  5. "LDWS": Levenshtein distance without substitution. Number of edits needed to transform one string into the other using only insertions or deletions of individual characters.

  6. "OSA": Optimal string alignment distance. Number of edits needed to transform one string into the other. Edits are insertions, deletions, substitutions, or transpositions of characters. A substring can be edited only once.

  7. "DL": Damerau-Levenshtein distance. Like OSA, except that a substring can be edited any number of times.

  8. "hamming": Hamming distance. For strings of equal length, number of positions where corresponding characters differ (that is, minimum number of substitutions needed to transform one string into the other). For strings of unequal length, -1.

  9. "LCS": Longest common substring. Length of longest substring common to both strings.

  10. "jaccard": Jaccard indexed-based comparison.

  11. "cosine": Cosine similarity.

  12. "soundexcode": Only for English strings: -1 if either string has a non-English character; otherwise, 1 if their soundex codes are the same and 0 otherwise.


You can specify a different comparison_type for every pair of columns.
The default output_column is "sim_i", where i is the sequence number of the column pair. Types: character OR vector of characters

case.sensitive

Optional Argument.
Specifies whether string comparison is case-sensitive. You can specify either one value for all pairs or one value for each pair. If you specify one value for each pair, then the ith value applies to the ith pair.
Default Value: FALSE
Types: logical OR vector of logicals

accumulate

Optional Argument.
SSpecifies the names of columns in the input argument "data" to be copied to the output tbl_teradata.
Types: character OR vector of Strings (character)

data.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

Value

Function returns an object of class "td_string_similarity_mle" which is a named list containing object of class "tbl_teradata".
Named list member can be referenced directly with the "$" operator using name: result.

Examples

  
    # Get the current context/connection
    con <- td_get_context()$connection
    
    # Load example data.
    loadExampleData("stringsimilarity_example", "strsimilarity_input")

    # Create object(s) of class "tbl_teradata".
    strsimilarity_input <- tbl(con, "strsimilarity_input")

    # Using "jaro" comparison type with a default output column
    td_string_similarity_out1 <- td_string_similarity_mle(data = strsimilarity_input,
                                          comparison.columns = "jaro (src_text1, tar_text)",
                                          accumulate = c("id","src_text1","tar_text")
                                         )

    # Using multiple comparison types and with custom output columns
    comp.columns <- c("jaro (src_text1, tar_text) AS jaro1_sim", 
                      "LD (src_text1, tar_text, 2) AS ld1_sim",
                      "n_gram (src_text1, tar_text, 2) AS ngram1_sim",
                      "jaro_winkler (src_text1, tar_text, 2) AS jw1_sim")
    td_string_similarity_out2 <- td_string_similarity_mle(data = strsimilarity_input,
                                                          comparison.columns = comp.columns,
                                                          case.sensitive = TRUE,
                                                          accumulate = c("id","src_text1",
                                                                         "tar_text")
                                                          )


    # Using a vector for "case.sensitive" comparisons.
    # Note: The length of the "case.sensitive" vector must match the "comparison.columns" 
    # vector argument.
    comp.columns <- c("jaro (src_text2, tar_text) AS jaro2_case_sim",
                      "jaro (src_text2, tar_text) AS jaro2_nocase_sim")
    td_string_similarity_out3 <- td_string_similarity_mle(data = strsimilarity_input,
                                                          comparison.columns = comp.columns,
                                                          case.sensitive = c(TRUE, FALSE),
                                                          accumulate = c("id","src_text2",
                                                                         "tar_text")
                                                          )