StringSimilarity Function | Teradata Vantage - StringSimilarity - Analytics Database

Database Analytic Functions

Deployment
VantageCloud
VantageCore
Edition
VMware
Enterprise
IntelliFlex
Product
Analytics Database
Release Number
17.20
Published
June 2022
ft:locale
en-US
ft:lastEdition
2025-11-06
dita:mapPath
gjn1627595495337.ditamap
dita:ditavalPath
qkf1628213546010.ditaval
dita:id
jmh1512506877710
Product Category
Teradata Vantageā„¢

String similarity functions are used in data cleaning to identify and remove duplicates or near-duplicates in datasets. These functions allow for the comparison of two or more strings, and the determination of how similar they are to each other.

String similarity functions in data cleaning accurately identify and merge duplicate records, even when the duplicates contain slight variations or errors in formatting. This is important when large datasets are used, as manual identification of duplicates can be time-consuming and error prone.

One common use case for string similarity functions is in the merging of records from different sources. For example, if two data sources contain records for the same individual, but one source uses a nickname or misspells the individual's name, string similarity functions can identify and merge these records.

String similarity functions can also be used to identify and remove records that contain typos or other errors. For example, if a dataset contains records for "John Doe" and "Jon Doe," a string similarity function can identify these records as duplicates and remove one of them.

String similarity functions help ensure the accuracy and completeness of datasets.

The StringSimilarity function calculates the similarity between two strings, using a specified comparison method. The similarity is a value in the range [0, 1].

  • This function requires the UTF8 client character set for UNICODE data.
  • This function does not support Pass Through Characters (PTCs).

    For information about PTCs, see International Character Set Support, B035-1125.

  • When comparing strings, the function assumes that they are in the same Unicode script in Normalization Form C (NFC).
  • When used with this function, the ORDER BY clause supports only ASCII collation.