The NGramSplitter function tokenizes (splits) an input stream of text and outputs n multigrams (called n -grams) based on the specified Reset, Punctuation, and Delimiter syntax elements. NGramSplitter first splits sentences, next removes punctuation characters from them, and finally splits the words into n-grams.
NGramSplitter provides more flexibility than standard tokenization when performing text analysis. Many two-word phrases carry important meaning (for example, "machine learning") that single-word tokens do not capture. This, combined with additional analytical techniques, can be useful for performing sentiment analysis, topic identification, and document classification.
NGramSplitter considers each input row to be one document, and returns a row for each unique n-gram in each document. NGramSplitter also returns, for each document, the counts of each n-gram and the total number of n-grams.
- This function requires the UTF8 client character set.
- This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Advanced SQL Engine International Character Set Support, B035-1125.