Description
The StringSimilarity function calculates the similarity between two strings,
using either the Jaro, Jaro-Winkler, N-Gram, or Levenshtein distance.
The similarity is a value in the range [0, 1].
Usage
td_string_similarity_mle (
data = NULL,
comparison.columns = NULL,
case.sensitive = NULL,
accumulate = NULL,
data.sequence.column = NULL,
data.order.column = NULL
)
Arguments
data |
Required Argument. |
data.order.column |
Optional Argument. |
comparison.columns |
Required Argument.
|
case.sensitive |
Optional Argument. |
accumulate |
Optional Argument. |
data.sequence.column |
Optional Argument. |
Value
Function returns an object of class "td_string_similarity_mle" which
is a named list containing object of class "tbl_teradata".
Named list member can be referenced directly with the "$" operator
using name: result.
Examples
# Get the current context/connection
con <- td_get_context()$connection
# Load example data.
loadExampleData("stringsimilarity_example", "strsimilarity_input")
# Create object(s) of class "tbl_teradata".
strsimilarity_input <- tbl(con, "strsimilarity_input")
# Using "jaro" comparison type with a default output column
td_string_similarity_out1 <- td_string_similarity_mle(data = strsimilarity_input,
comparison.columns = "jaro (src_text1, tar_text)",
accumulate = c("id","src_text1","tar_text")
)
# Using multiple comparison types and with custom output columns
comp.columns <- c("jaro (src_text1, tar_text) AS jaro1_sim",
"LD (src_text1, tar_text, 2) AS ld1_sim",
"n_gram (src_text1, tar_text, 2) AS ngram1_sim",
"jaro_winkler (src_text1, tar_text, 2) AS jw1_sim")
td_string_similarity_out2 <- td_string_similarity_mle(data = strsimilarity_input,
comparison.columns = comp.columns,
case.sensitive = TRUE,
accumulate = c("id","src_text1",
"tar_text")
)
# Using a vector for "case.sensitive" comparisons.
# Note: The length of the "case.sensitive" vector must match the "comparison.columns"
# vector argument.
comp.columns <- c("jaro (src_text2, tar_text) AS jaro2_case_sim",
"jaro (src_text2, tar_text) AS jaro2_nocase_sim")
td_string_similarity_out3 <- td_string_similarity_mle(data = strsimilarity_input,
comparison.columns = comp.columns,
case.sensitive = c(TRUE, FALSE),
accumulate = c("id","src_text2",
"tar_text")
)