1.0 - 8.00 - Text Analysis Function Families - Teradata Vantage

Teradata® Vantage Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
1.0
8.00
Release Date
May 2019
Content Type
Programming Reference
Publication ID
B700-4003-098K
Language
English (United States)
Function Family Description Usage Notes
Latent Dirichlet Allocation (LDA) Functions for training a topic model and applying it to a set of documents.  
LevenshteinDistance Computes the Levenshtein distance between two strings. The Levenshtein distance is the number of edits (single character changes) needed to convert one string into another. Useful for matching strings when small errors are common, as with user-entered text.
NaiveBayesTextClassifier Set of supervised learning functions for training and applying a document classifier based on a naïve Bayes classification algorithm that has been optimized for text applications. Input is typically created by the TextTokenizer or TextParser function.
Named Entity Recognition (NER) Named Entity Recognition finds specified entities in text. Typical named entities are phone numbers, email addresses, dates, personal names, and geographical locations. The NER functions let you specify how to identify named entities. ML Engine provides two sets of NER functions—one based on a Conditional Random Fields (CRF) algorithm and one based on a maximum entropy algorithm. The CRF model implementation supports English, Simplified Chinese, and Traditional Chinese text. The maximum entropy model implementation supports only English text.
NGrams Tokenizes English input text into n-grams. An n-gram consists of n consecutive words in the input text. Compare to TextTokenizer, which tokenizes English input text into single words.
POSTagger Creates parts-of-speech for words in input text, based on predefined models. Supports English and Simplified Chinese text. Each row of input must contain a sentence. To preprocess English text into sentences, use the SentenceExtractor function.
SentenceExtractor Extracts sentences from English input text. You can use this function to preprocess input for the POSTagger or TextChunker function.
SentimentExtraction Set of functions for training, applying, and evaluating a model for identifying the user sentiments expressed in text. You can either train a model based on labeled input text or use a predefined dictionary model. Supports English, Simplified Chinese, and Traditional Chinese text.
TextClassifier Set of supervised learning functions for training, applying, and evaluating a model for classifying text into predefined categories. Supports English, Simplified Chinese, and Traditional Chinese text.
TextChunker Divides input text into phrases. The input text must be single sentences, each with a unique identifier. Input is typically created by the POSTagger function. If the input contains paragraphs, use the SentenceExtractor function to divide them into sentences.

Compare to TextTokenizer, which tokenizes English input text into single words.

TextMorph Converts each input word to standard form, using a lemmatization algorithm based on the WordNet 3.0 dictionary. Input can be created by the POSTagger function.
TextParser Extracts tokens from English text. Offers more options than TextTokenizer does for English text, such as stemming and stop- word removal.
TextTagger Tags (classifies) text, based on user-defined rules.  
TextTokenizer Extracts tokens from Chinese, Japanese, or English text.  
TFIDF Scores documents within a document set, based on frequency of terms within the document relative to their frequency within the set of documents. Converts text into numeric features for document classification and clustering.