Teradata R Package Function Reference - 16.20 - StringSimilarity - Teradata R Package

Teradata® R Package Function Reference

prodname
Teradata R Package
vrm_release
16.20
created_date
February 2020
category
Programming Reference
featnum
B700-4007-098K

Description

The StringSimilarity function calculates the similarity between two strings, using either the Jaro, Jaro-Winkler, N-Gram, or Levenshtein distance. The similarity is a value in the range [0, 1].

Usage

  td_string_similarity_mle (
      data = NULL,
      comparison.columns = NULL,
      case.sensitive = NULL,
      accumulate = NULL,
      data.sequence.column = NULL
  )

Arguments

data

Required Argument.
The relation that contains the string pairs to be compared.

comparison.columns

Required Argument. Syntax: comparison_type (column1, column2 [, constant]) AS output_column.
Specifies pairs of columns (column1 and column2) in the input argument "data", that contain strings to be compared, how to compare them (comparison_type), and (optionally) a constant and the name of the output column for their similarity (output_column). The similarity is a value in the range [0, 1].
For comparison_type, use one of these values:

  1. "jaro": Jaro distance

  2. "jaro_winkler": Jaro-Winkler distance (1 for an exact match, 0 otherwise)

  3. "n-gram": N-gram similarity. If you specify this comparison type, you can specify the value of N with constant.

  4. "LD": Levenshtein distance (the number of edits needed to transform one string into the other, where edits include insertions, deletions, or substitutions of individual characters).

You can specify a different comparison_type for every pair of columns.
The default output_column is "sim_i", where i is the sequence number of the column pair.

case.sensitive

Optional Argument.
Specifies whether string comparison is case-sensitive. You can specify either one value for all pairs or one value for each pair. If you specify one value for each pair, then the ith value applies to the ith pair. Default Value: FALSE

accumulate

Optional Argument.
Specifies the names of columns in the input argument "data" to be copied to the output table.

data.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.

Value

Function returns an object of class "td_string_similarity_mle" which is a named list containing Teradata tbl object.
Named list member can be referenced directly with the "$" operator using name: result.

Examples

    # Get the current context/connection
    con <- td_get_context()$connection
    
    # Load example data.
    loadExampleData("stringsimilarity_example", "strsimilarity_input")
    
    # Create remote tibble objects.
    strsimilarity_input <- tbl(con, "strsimilarity_input")
    
    # Using "jaro" comparison type with a default output column
    td_string_similarity_out1 <- td_string_similarity_mle(data = strsimilarity_input,
                                          comparison.columns = "jaro (src_text1, tar_text)",
                                          accumulate = c("id","src_text1","tar_text")
                                         )

    # Using multiple comparison types and with custom output columns
    td_string_similarity_out2 <- td_string_similarity_mle(data = strsimilarity_input,
                                          comparison.columns = c("jaro (src_text1, tar_text) AS jaro1_sim",
                                          "LD (src_text1, tar_text, 2) AS ld1_sim",
                                          "n_gram (src_text1, tar_text, 2) AS ngram1_sim",
                                          "jaro_winkler (src_text1, tar_text, 2) AS jw1_sim"),
                                          case.sensitive = TRUE,
                                          accumulate = c("id","src_text1","tar_text")
                                         )


    # Using a vector for case.sensitive comparisons.
    # Note: The length of the case.sensitive vector must match the comparison.columns vector argument.
    td_string_similarity_out3 <- td_string_similarity_mle(data = strsimilarity_input,
                                          comparison.columns = c("jaro (src_text2, tar_text) AS jaro2_case_sim",
                                          "jaro (src_text2, tar_text) AS jaro2_nocase_sim"),
                                          case.sensitive = c(TRUE, FALSE),
                                          accumulate = c("id","src_text2","tar_text")
                                         )