| |
- editdistance(string_column_expression1, string_column_expression2, ci, cd, cs, ct)
- DESCRIPTION:
Function returns the minimum number of edit operations (insertions, deletions,
substitutions and transpositions) required to transform string1 (string_column_expression1)
into string2 (string_column_expression2).
EDITDISTANCE measures the similarity between two strings. A low number of deletions,
insertions, substitutions or transpositions implies a high similarity. The insertions,
deletions, substitutions, and transpositions are based on the Damerau-Levenshtein
Distance algorithm with modifications for costed operations.
PARAMETERS:
string_column_expression1:
Required Argument.
Specifies a ColumnExpression of a string column or a string literal.
Format of a ColumnExpression of a string column: '<dataframe>.<dataframe_column>.expression'.
Support column types are: CHARACTER, VARCHAR, or CLOB.
string_column_expression2:
Required Argument.
Specifies a ColumnExpression of a string column or a string literal.
Format of a ColumnExpression of a string column: '<dataframe>.<dataframe_column>.expression'.
Support column types are: CHARACTER, VARCHAR, or CLOB.
ci:
Optional Argument.
Specifies the relative cost of an insert operation.
The value specified must be a non-negative integer.
If not specified, a default value of 1 is used.
cd:
Optional Argument.
Specifies the relative cost of a delete operation.
The value specified must be a non-negative integer.
If not specified, a default value of 1 is used.
cs:
Optional Argument.
Specifies the relative cost of a substitute operation.
The value specified must be a non-negative integer.
If not specified, a default value of 1 is used.
ct:
Optional Argument.
Specifies the relative cost of a transpose operation.
The value specified must be a non-negative integer.
If not specified, a default value of 1 is used.
NOTE:
Function accepts positional arguments only.
EXAMPLES:
# Load the data to run the example.
>>> load_example_data("dataframe", "admissions_train")
>>>
# Create a DataFrame on 'admissions_train' table.
>>> admissions_train = DataFrame("admissions_train")
>>> admissions_train
masters gpa stats programming admitted
id
22 yes 3.46 Novice Beginner 0
36 no 3.00 Advanced Novice 0
15 yes 4.00 Advanced Advanced 1
38 yes 2.65 Advanced Beginner 1
5 no 3.44 Novice Novice 0
17 no 3.83 Advanced Advanced 1
34 yes 3.85 Advanced Beginner 0
13 no 4.00 Advanced Novice 1
26 yes 3.57 Advanced Advanced 1
19 yes 1.98 Advanced Advanced 0
>>>
# Import func from sqlalchemy to execute editdistance function.
>>> from sqlalchemy import func
# Example 1: Calculate the EDITDISTANCE between values in "stats" and "programming" columns.
# Create a sqlalchemy Function object.
>>> editdistance_func_ = func.editdistance(admissions_train.stats.expression, admissions_train.programming.expression)
>>>
# Pass the Function object as input to DataFrame.assign().
>>> df = admissions_train.assign(editdistance_gpa_=editdistance_func_)
>>> print(df)
masters gpa stats programming admitted editdistance_gpa_
id
15 yes 4.00 Advanced Advanced 1 0
7 yes 2.33 Novice Novice 1 0
22 yes 3.46 Novice Beginner 0 6
17 no 3.83 Advanced Advanced 1 0
13 no 4.00 Advanced Novice 1 5
38 yes 2.65 Advanced Beginner 1 6
26 yes 3.57 Advanced Advanced 1 0
5 no 3.44 Novice Novice 0 0
34 yes 3.85 Advanced Beginner 0 6
40 yes 3.95 Novice Beginner 0 6
>>>
# Example 2: Calculate the EDITDISTANCE between values in "stats" and "programming" columns with
# with cost associated with the edit operations passed.
# Create a sqlalchemy Function object.
# Note: We are using 'EDITDISTANCE' as function name. Function name is case-insensitive.
>>> editdistance_func_ = func.EDITDISTANCE(admissions_train.stats.expression, admissions_train.programming.expression, 2, 1, 1, 2)
>>>
# Pass the Function object as input to DataFrame.assign().
>>> df = admissions_train.assign(editdistance_gpa_=editdistance_func_)
>>> print(df)
masters gpa stats programming admitted editdistance_gpa_
id
22 yes 3.46 Novice Beginner 0 8
36 no 3.00 Advanced Novice 0 5
15 yes 4.00 Advanced Advanced 1 0
38 yes 2.65 Advanced Beginner 1 6
5 no 3.44 Novice Novice 0 0
17 no 3.83 Advanced Advanced 1 0
34 yes 3.85 Advanced Beginner 0 6
13 no 4.00 Advanced Novice 1 5
26 yes 3.57 Advanced Advanced 1 0
19 yes 1.98 Advanced Advanced 0 0
>>>
|