Description
The StringSimilarity function calculates the similarity between two
strings, using either the Jaro, Jaro-Winkler, N-Gram, or
Levenshtein distance. The similarity is a value in the range [0, 1].
Note: This function is only available when tdplyr is connected to Vantage 1.1
or later versions.
Usage
td_string_similarity_sqle (
data = NULL,
comparison.columns = NULL,
case.sensitive = NULL,
accumulate = NULL,
data.order.column = NULL
)
Arguments
data |
Required Argument. |
data.order.column |
Optional Argument. |
comparison.columns |
Required Argument.
You can specify a different comparison type for every pair of
columns. The default output_column is "sim_i", where i is the
sequence number of the column pair. |
case.sensitive |
Optional Argument. |
accumulate |
Optional Argument. |
Value
Function returns an object of class "td_string_similarity_sqle" which
is a named list containing object of class "tbl_teradata".
Named list member can be referenced directly with the "$" operator
using the name: result.
Examples
# Get the current context/connection
con <- td_get_context()$connection
# Load example data.
loadExampleData("stringsimilarity_example", "strsimilarity_input")
# Create object(s) of class "tbl_teradata".
strsimilarity_input <- tbl(con, "strsimilarity_input")
# Example 1 - Using "jaro" comparison type with a default output column.
td_string_similarity_sqle_out <- td_string_similarity_sqle(data = strsimilarity_input,
case.sensitive = TRUE,
comparison.columns = c("jaro (src_text2, tar_text) AS jaro2_case_sim"),
accumulate = c("id","src_text1","tar_text")
)
# Example 2 - Using multiple comparison types and with custom output columns.
td_string_similarity_sqle_out2 <- td_string_similarity_sqle(data = strsimilarity_input,
comparison.columns = c("jaro (src_text1, tar_text) AS jaro1_sim",
"LD (src_text1, tar_text, 2) AS ld1_sim",
"n_gram (src_text1, tar_text, 2) AS ngram1_sim",
"jaro_winkler (src_text1, tar_text, 0.2) AS jw1_sim"),
case.sensitive = TRUE,
accumulate = c("id","src_text1","tar_text")
)