| |
Methods defined here:
- __init__(self, data=None, comparison_columns=None, case_sensitive=None, accumulate=None, data_sequence_column=None, data_order_column=None)
- DESCRIPTION:
The StringSimilarity function calculates the similarity between two
strings, using either the Jaro, Jaro-Winkler, N-Gram, or Levenshtein
distance. The similarity is a value in the range [0, 1].
PARAMETERS:
data:
Required Argument.
The teradataml DataFrame contains the string pairs to be compared.
data_order_column:
Optional Argument.
Specifies Order By columns for data.
Values to this argument can be provided as a list, if multiple
columns are used for ordering.
Types: str OR list of Strings (str)
comparison_columns:
Required Argument.
Specifies pairs of input teradataml DataFrame columns that contain
strings to be compared (column1 and column2), how to compare them
(comparison_type), and (optionally) a constant and the name of the
output column for their similarity (output_column). The similarity is
a value in the range [0, 1].
For comparison_type, use one of these values:
• "jaro": Jaro distance
• "jaro_winkler": Jaro-Winkler distance (1 for an exact match, 0 otherwise).
Note:
If you specify this comparison type when teradataml is
connected to Vantage 1.3, you can specify the value of
factor p with constant (0 ≤ p ≤ 0.25).
Default: p = 0.1
• "n-gram": N-gram similarity, if you specify this comparison type, you can specify the
value of N with constant.
• "LD": Levenshtein distance (the number of edits needed to
transform one string into the other, where edits include
insertions, deletions, or substitutions of individual
characters).
You can specify a different comparison_type for every pair of
columns. The default output_column is "sim_i", where i is the
sequence number of the column pair.
Types: str OR list of Strings (str)
case_sensitive:
Optional Argument.
Specifies whether string comparison is case-sensitive. The default
value is "false". You can specify either one value for all pairs or
one value for each pair. If you specify one value for each pair, then
the ith value applies to the ith pair.
Types: bool OR list of bools
accumulate:
Optional Argument.
Specifies the names of input teradataml DataFrame columns to be
copied to the output table.
Types: str OR list of Strings (str)
data_sequence_column:
Optional Argument.
Specifies the list of column(s) that uniquely identifies each row of
the input argument "data". The argument is used to ensure
deterministic results for functions which produce results that vary
from run to run.
Types: str OR list of Strings (str)
RETURNS:
Instance of StringSimilarity.
Output teradataml DataFrames can be accessed using attribute
references, such as StringSimilarityObj.<attribute_name>.
Output teradataml DataFrame attribute name is:
result
RAISES:
TeradataMlException
EXAMPLES:
# Load example data.
load_example_data("stringsimilarity", "strsimilarity_input")
# Create teradataml DataFrame objects.
strsimilarity_input = DataFrame.from_table("strsimilarity_input")
# Example 1 - Using "jaro" comparison type with a default output column
stringsimilarity_out1 = StringSimilarity(data=strsimilarity_input,
comparison_columns=['jaro (src_text1 , tar_text ) AS jaro1_sim',
'LD (src_text1 , tar_text, 2) AS ld1_sim',
'n_gram (src_text1 , tar_text, 2) AS ngram1_sim',
'jaro_winkler (src_text1 , tar_text, 0.25) AS jw1_sim'],
case_sensitive=True,
accumulate = ["id","src_text1","tar_text"],
data_sequence_column='id')
# Print result dataframe.
print(stringsimilarity_out1.result)
# Example 2 - Using multiple comparison types and with custom output columns
stringsimilarity_out2 = StringSimilarity(data=strsimilarity_input,
comparison_columns=['jaro (src_text2 , tar_text ) AS jaro2_sim',
'LD (src_text2 , tar_text, 2) AS ld2_sim',
'n_gram (src_text2 , tar_text, 2) AS ngram2_sim',
'jaro_winkler (src_text2 , tar_text, 0.25) AS jw2_sim'],
case_sensitive=True,
accumulate = ["id","src_text2","tar_text"],
data_sequence_column='id')
# Print result dataframe.
print(stringsimilarity_out2.result)
# Example 3- Using a vector for case_sensitive comparisons.
# Note: The length of the case_sensitive vector must match the
# comparison_columns vector argument.
stringsimilarity_out3 = StringSimilarity(data=strsimilarity_input,
comparison_columns=["jaro (src_text2, tar_text) AS jaro2_case_sim",
"jaro (src_text2, tar_text) AS jaro2_nocase_sim"],
case_sensitive=[True,False],
accumulate = ["id","src_text2","tar_text"],
)
# Print result dataframe.
print(stringsimilarity_out3)
- __repr__(self)
- Returns the string representation for a StringSimilarity class instance.
- get_build_time(self)
- Function to return the build time of the algorithm in seconds.
When model object is created using retrieve_model(), then the value returned is
as saved in the Model Catalog.
- get_prediction_type(self)
- Function to return the Prediction type of the algorithm.
When model object is created using retrieve_model(), then the value returned is
as saved in the Model Catalog.
- get_target_column(self)
- Function to return the Target Column of the algorithm.
When model object is created using retrieve_model(), then the value returned is
as saved in the Model Catalog.
- show_query(self)
- Function to return the underlying SQL query.
When model object is created using retrieve_model(), then None is returned.
|