7.00.02 - Summary of Text Analysis Function Families - Aster Analytics

Teradata Aster® Analytics Foundation User GuideUpdate 2

Product
Aster Analytics
Release Number
7.00.02
Published
September 2017
Content Type
Programming Reference
User Guide
Publication ID
B700-1022-700K
Language
English (United States)
Last Update
2018-04-17
Function Family Description Usage Notes
Latent Dirichlet Allocation (LDA) Functions for training a topic model and applying it to a set of documents.  
LevenshteinDistance (LDist) Computes the Levenshtein distance between two strings. The Levenshtein distance is the number of edits (single character changes) needed to convert one string into another. Useful for matching strings when small errors are common, as with user-entered text.
NaiveBayesTextClassifier Set of supervised learning functions for training and applying a document classifier based on a naïve Bayes classification algorithm that has been optimized for text applications. Input is typically generated by the TextTokenizer or Text_Parser function.
Named Entity Recognition (NER) Named Entity Recognition finds specified entities in text. Typical named entities are phone numbers, email addresses, dates, personal names, and geographical locations. The NER functions let you specify how to identify named entities. Aster Analytics provides two sets of NER functions—one based on a Conditional Random Fields (CRF) algorithm and one based on a maximum entropy algorithm. The CRF model implementation supports English, Simplified Chinese, and Traditional Chinese text. The maximum entropy model implementation supports only English text.
nGram Tokenizes English input text into n-grams. An n-gram consists of n consecutive words in the input text. Compare to TextTokenizer, which tokenizes English input text into single words.
POSTagger Generates parts-of-speech for words in input text, based on predefined models. Supports English and Simplified Chinese text. Each row of input must contain a sentence. To preprocess English text into sentences, use the Sentenizer function.
Sentenizer Extracts sentences from English input text. You can use this function to preprocess input for the POSTagger or TextChunker function.
SentimentExtraction Set of functions for training, applying, and evaluating a model for identifying the user sentiments expressed in text. You can either train a model based on labeled input text or use a predefined dictionary model. Supports English, Simplified Chinese, and Traditional Chinese text.
TextClassifier Set of supervised learning functions for training, applying, and evaluating a model for classifying text into predefined categories. Supports English, Simplified Chinese, and Traditional Chinese text.
TextChunker Divides input text into phrases. The input text must be single sentences, each with a unique identifier. Input is typically generated by the POSTagger function. If the input contains paragraphs, use the Sentenizer function to divide them into sentences.

Compare to TextTokenizer, which tokenizes English input text into single words.

TextMorph Converts each input word to standard form, using a lemmatization algorithm based on the WordNet 3.0 dictionary. Input can be generated by the POSTagger function.
Text_Parser Extracts tokens from English text. Offers more options than TextTokenizer does for English text, such as stemming and stop- word removal.
TextTagging Tags (classifies) text, based on user-defined rules.  
TextTokenizer Extracts tokens from Chinese, Japanese, or English text.  
TF_IDF Scores documents within a document set, based on frequency of terms within the document relative to their frequency within the set of documents. Converts text into numeric features for document classification and clustering.