The NGramSplitter_MLE function tokenizes (splits) an input stream of text and outputs n multigrams (called n -grams) based on the specified delimiter and reset parameters. NGramSplitter_MLE provides more flexibility than standard tokenization when performing text analysis. Many two-word phrases carry important meaning (for example, "machine learning") that unigrams (single-word tokens) do not capture. This, combined with additional analytical techniques, can be useful for performing sentiment analysis, topic identification, and document classification.
NGramSplitter_MLE considers each input row to be one document, and returns a row for each unique n-gram in each document. NGramSplitter_MLE also returns, for each document, the counts of each n-gram and the total number of n-grams.