Function Family | Description | Usage Notes |
---|---|---|
Latent Dirichlet Allocation (LDA) | Functions for training a topic model and applying it to a set of documents. | |
LevenshteinDistance (LDist) | Computes the Levenshtein distance between two strings. The Levenshtein distance is the number of edits (single character changes) needed to convert one string into another. | Useful for matching strings when small errors are common, as with user-entered text. |
NaiveBayesTextClassifier | Set of supervised learning functions for training and applying a document classifier based on a naïve Bayes classification algorithm that has been optimized for text applications. | Input is typically generated by the TextTokenizer or Text_Parser function. |
Named Entity Recognition (NER) | Named Entity Recognition finds specified entities in text. Typical named entities are phone numbers, email addresses, dates, personal names, and geographical locations. The NER functions let you specify how to identify named entities. Aster Analytics provides two sets of NER functions—one based on a Conditional Random Fields (CRF) algorithm and one based on a maximum entropy algorithm. | The CRF model implementation supports English, Simplified Chinese, and Traditional Chinese text. The maximum entropy model implementation supports only English text. |
nGram | Tokenizes English input text into n-grams. An n-gram consists of n consecutive words in the input text. | Compare to TextTokenizer, which tokenizes English input text into single words. |
POSTagger | Generates parts-of-speech for words in input text, based on predefined models. | Supports English and Simplified Chinese text. Each row of input must contain a sentence. To preprocess English text into sentences, use the Sentenizer function. |
Sentenizer | Extracts sentences from English input text. | You can use this function to preprocess input for the POSTagger or TextChunker function. |
SentimentExtraction | Set of functions for training, applying, and evaluating a model for identifying the user sentiments expressed in text. | You can either train a model based on labeled input text or use a predefined dictionary model. Supports English, Simplified Chinese, and Traditional Chinese text. |
TextClassifier | Set of supervised learning functions for training, applying, and evaluating a model for classifying text into predefined categories. | Supports English, Simplified Chinese, and Traditional Chinese text. |
TextChunker | Divides input text into phrases. The input text must be single sentences, each with a unique identifier. | Input is typically generated by the POSTagger function. If the input contains paragraphs, use the Sentenizer function to divide them into sentences. Compare to TextTokenizer, which tokenizes English input text into single words. |
TextMorph | Converts each input word to standard form, using a lemmatization algorithm based on the WordNet 3.0 dictionary. | Input can be generated by the POSTagger function. |
Text_Parser | Extracts tokens from English text. | Offers more options than TextTokenizer does for English text, such as stemming and stop- word removal. |
TextTagging | Tags (classifies) text, based on user-defined rules. | |
TextTokenizer | Extracts tokens from Chinese, Japanese, or English text. | |
TF_IDF | Scores documents within a document set, based on frequency of terms within the document relative to their frequency within the set of documents. | Converts text into numeric features for document classification and clustering. |