NGrams - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
8.00
1.0
Published
May 2019
Language
English (United States)
Last Update
2019-11-22
dita:mapPath
blj1506016597986.ditamap
dita:ditavalPath
blj1506016597986.ditaval
dita:id
B700-4003
lifecycle
previous
Product Category
Teradata Vantageā„¢

The NGrams function tokenizes (splits) an input stream of text and outputs n multigrams (called n -grams) based on the specified delimiter and reset parameters. NGrams provides more flexibility than standard tokenization when performing text analysis. Many two-word phrases carry important meaning (for example, "machine learning") that unigrams (single-word tokens) do not capture. This, combined with additional analytical techniques, can be useful for performing sentiment analysis, topic identification, and document classification.

NGrams considers each input row to be one document, and returns a row for each unique n-gram in each document. NGrams also returns, for each document, the counts of each n-gram and the total number of n-grams.

For general information about tokenization, see http://en.wikipedia.org/wiki/Lexical_analysis#Tokenizer.