16.20 - NGramSplitter - Teradata Database - Teradata Vantage NewSQL Engine

Teradata Vantage™ - NewSQL Engine Analytic Functions

Product
Teradata Database
Teradata Vantage NewSQL Engine
Release Number
16.20
Release Date
July 2019
Content Type
Programming Reference
Publication ID
B035-1206-162K
Language
English (United States)

The NGramSplitter function tokenizes (splits) an input stream of text and outputs n multigrams (called n -grams) based on the specified Reset, Punctuation, and Delimiter syntax elements. NGramSplitter first splits sentences, next removes punctuation characters from them, and finally splits the words into n-grams.

NGramSplitter provides more flexibility than standard tokenization when performing text analysis. Many two-word phrases carry important meaning (for example, "machine learning") that single-word tokens do not capture. This, combined with additional analytical techniques, can be useful for performing sentiment analysis, topic identification, and document classification.

NGramSplitter considers each input row to be one document, and returns a row for each unique n-gram in each document. NGramSplitter also returns, for each document, the counts of each n-gram and the total number of n-grams.

  • This function requires the UTF8 client character set.
  • This function does not support Pass Through Characters (PTCs).

    For information about PTCs, see Teradata Vantage™ NewSQL Engine International Character Set Support, B035-1125.