Teradata Python Package Function Reference - StringSimilarity - Teradata Python Package - Look here for syntax, methods and examples for the functions included in the Teradata Python Package.

teradataml.analytics.mle.StringSimilarity = class StringSimilarity(builtins.object)

Methods defined here:

__init__(self, data=None, comparison_columns=None, case_sensitive=None, accumulate=None, data_sequence_column=None, data_order_column=None): DESCRIPTION: The StringSimilarity function calculates the similarity between two strings, using either the Jaro, Jaro-Winkler, N-Gram, or Levenshtein distance. The similarity is a value in the range [0, 1]. PARAMETERS: data: Required Argument. The teradataml DataFrame contains the string pairs to be compared. data_order_column: Optional Argument. Specifies Order By columns for data. Values to this argument can be provided as a list, if multiple columns are used for ordering. Types: str OR list of Strings (str) comparison_columns: Required Argument. Specifies pairs of input teradataml DataFrame columns that contain strings to be compared (column1 and column2), how to compare them (comparison_type), and (optionally) a constant and the name of the output column for their similarity (output_column). The similarity is a value in the range [0, 1]. For comparison_type, use one of these values: • "jaro": Jaro distance • "jaro_winkler": Jaro-Winkler distance (1 for an exact match, 0 otherwise) • "n-gram": N-gram similarity, if you specify this comparison type, you can specify the value of N with constant. • "LD": Levenshtein distance (the number of edits needed to transform one string into the other, where edits include insertions, deletions, or substitutions of individual characters). You can specify a different comparison_type for every pair of columns. The default output_column is "sim_i", where i is the sequence number of the column pair. Types: str OR list of Strings (str) case_sensitive: Optional Argument. Specifies whether string comparison is case-sensitive. The default value is "false". You can specify either one value for all pairs or one value for each pair. If you specify one value for each pair, then the ith value applies to the ith pair. Types: bool OR list of bools accumulate: Optional Argument. Specifies the names of input teradataml DataFrame columns to be copied to the output table. Types: str OR list of Strings (str) data_sequence_column: Optional Argument. Specifies the list of column(s) that uniquely identifies each row of the input argument "data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run. Types: str OR list of Strings (str) RETURNS: Instance of StringSimilarity. Output teradataml DataFrames can be accessed using attribute references, such as StringSimilarityObj.<attribute_name>. Output teradataml DataFrame attribute name is: result RAISES: TeradataMlException EXAMPLES: # Load example data. load_example_data("stringsimilarity", "strsimilarity_input") # Create teradataml DataFrame objects. strsimilarity_input = DataFrame.from_table("strsimilarity_input") # Example 1 - Using "jaro" comparison type with a default output column stringsimilarity_out1 = StringSimilarity(data=strsimilarity_input, comparison_columns=['jaro (src_text1 , tar_text ) AS jaro1_sim','LD (src_text1 , tar_text, 2) AS ld1_sim','n_gram (src_text1 , tar_text, 2) AS ngram1_sim','jaro_winkler (src_text1 , tar_text, 2) AS jw1_sim'], case_sensitive=True, accumulate = ["id","src_text1","tar_text"], data_sequence_column='id') # Print result dataframe. print(stringsimilarity_out1.result) # Example 2 - Using multiple comparison types and with custom output columns stringsimilarity_out2 = StringSimilarity(data=strsimilarity_input, comparison_columns=['jaro (src_text2 , tar_text ) AS jaro2_sim', 'LD (src_text2 , tar_text, 2) AS ld2_sim', 'n_gram (src_text2 , tar_text, 2) AS ngram2_sim', 'jaro_winkler (src_text2 , tar_text, 2) AS jw2_sim'], case_sensitive=True, accumulate = ["id","src_text2","tar_text"], data_sequence_column='id') # Print result dataframe. print(stringsimilarity_out2.result) # Example 3- Using a vector for case_sensitive comparisons. # Note: The length of the case_sensitive vector must match the # comparison_columns vector argument. stringsimilarity_out3 = StringSimilarity(data=strsimilarity_input, comparison_columns=["jaro (src_text2, tar_text) AS jaro2_case_sim", "jaro (src_text2, tar_text) AS jaro2_nocase_sim"], case_sensitive=[True,False], accumulate = ["id","src_text2","tar_text"], ) # Print result dataframe. print(stringsimilarity_out3)

__repr__(self): Returns the string representation for a StringSimilarity class instance.