Description
The StringSimilarity function calculates the similarity between two strings,
using either the Jaro, Jaro-Winkler, N-Gram, or Levenshtein distance.
The similarity is a value in the range [0, 1].
Usage
td_string_similarity_mle (
data = NULL,
comparison.columns = NULL,
case.sensitive = NULL,
accumulate = NULL,
data.sequence.column = NULL,
data.order.column = NULL
)
Arguments
data |
Required Argument.
Specifies the input tbl_teradatda object that contains the string pairs to be compared.
|
data.order.column |
Optional Argument.
Specifies Order By columns for "data".
Values to this argument can be provided as a vector, if multiple
columns are used for ordering.
Types: character OR vector of Strings (character)
|
comparison.columns |
Required Argument.
Syntax: comparison_type (column1, column2 [, constant]) AS output_column.
Specifies pairs of columns (column1 and column2) in the
input argument "data", that contain strings to be compared, how to compare them
(comparison_type), and (optionally) a constant and the name of the
output column for their similarity (output_column). The similarity is
a value in the range [0, 1].
For comparison_type, use one of these values:
"jaro": Jaro distance.
"jaro_winkler": Jaro-Winkler distance (1 for an exact match, 0 otherwise).
"n-gram": N-gram similarity. If you specify this comparison type, you
can specify the value of N with constant.
"LD": Levenshtein distance (the number of
edits needed to transform one string into the other, where edits
include insertions, deletions, or substitutions of individual
characters).
"LDWS": Levenshtein distance without substitution.
Number of edits needed to transform one string into the other using
only insertions or deletions of individual characters.
"OSA": Optimal string alignment distance. Number of edits needed to
transform one string into the other. Edits are insertions, deletions,
substitutions, or transpositions of characters. A substring can be
edited only once.
"DL": Damerau-Levenshtein distance. Like OSA, except that a
substring can be edited any number of times.
"hamming": Hamming distance. For strings of equal length, number
of positions where corresponding characters differ (that is, minimum number of
substitutions needed to transform one string into the other). For
strings of unequal length, -1.
"LCS": Longest common substring. Length of longest substring common to
both strings.
"jaccard": Jaccard indexed-based comparison.
"cosine": Cosine similarity.
"soundexcode": Only for English strings: -1 if either string has a
non-English character; otherwise, 1 if their soundex codes are the
same and 0 otherwise.
You can specify a different comparison_type for every
pair of columns.
The default output_column is "sim_i", where i is the
sequence number of the column pair.
Types: character OR vector of characters
|
case.sensitive |
Optional Argument.
Specifies whether string comparison is case-sensitive. You can specify either
one value for all pairs or one value for each pair. If you specify one value for
each pair, then the ith value applies to the ith pair.
Default Value: FALSE
Types: logical OR vector of logicals
|
accumulate |
Optional Argument.
SSpecifies the names of columns in the input argument "data" to be copied to the
output tbl_teradata.
Types: character OR vector of Strings (character)
|
data.sequence.column |
Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row
of the input argument "data". The argument is used to ensure
deterministic results for functions which produce results that vary
from run to run.
Types: character OR vector of Strings (character)
|
Value
Function returns an object of class "td_string_similarity_mle" which
is a named list containing object of class "tbl_teradata".
Named list member can be referenced directly with the "$" operator
using name: result.
Examples
# Get the current context/connection
con <- td_get_context()$connection
# Load example data.
loadExampleData("stringsimilarity_example", "strsimilarity_input")
# Create object(s) of class "tbl_teradata".
strsimilarity_input <- tbl(con, "strsimilarity_input")
# Using "jaro" comparison type with a default output column
td_string_similarity_out1 <- td_string_similarity_mle(data = strsimilarity_input,
comparison.columns = "jaro (src_text1, tar_text)",
accumulate = c("id","src_text1","tar_text")
)
# Using multiple comparison types and with custom output columns
comp.columns <- c("jaro (src_text1, tar_text) AS jaro1_sim",
"LD (src_text1, tar_text, 2) AS ld1_sim",
"n_gram (src_text1, tar_text, 2) AS ngram1_sim",
"jaro_winkler (src_text1, tar_text, 2) AS jw1_sim")
td_string_similarity_out2 <- td_string_similarity_mle(data = strsimilarity_input,
comparison.columns = comp.columns,
case.sensitive = TRUE,
accumulate = c("id","src_text1",
"tar_text")
)
# Using a vector for "case.sensitive" comparisons.
# Note: The length of the "case.sensitive" vector must match the "comparison.columns"
# vector argument.
comp.columns <- c("jaro (src_text2, tar_text) AS jaro2_case_sim",
"jaro (src_text2, tar_text) AS jaro2_nocase_sim")
td_string_similarity_out3 <- td_string_similarity_mle(data = strsimilarity_input,
comparison.columns = comp.columns,
case.sensitive = c(TRUE, FALSE),
accumulate = c("id","src_text2",
"tar_text")
)