1.0 - 8.00 - Latent Dirichlet Allocation (LDA) Functions - Teradata Vantage

Teradata® Vantage Machine Learning Engine Analytic Function Reference

Teradata Vantage
Release Number
Release Date
May 2019
Content Type
Programming Reference
Publication ID
English (United States)
Function Description
LDA Uses training data and parameters to build topic model.
LDAInference Uses topic model to estimate topic distribution in document set.
LDATopicSummary Displays readable information from topic model.

Topic modeling, which is useful in text analysis, assumes that a document consists of multiple abstract topics with corresponding probabilities. Each topic emits a list of words with specific probability. That is, a word in a given document is output by a topic with certain probability decided by the topic, and the probability of the topic is decided according to the document.

In the model, a document can contain many topics. For example, the words "rainy" and "sunny," which are related to weather, and "basketball" and "football," which are related to sports. If 20% of a document is about weather and the remainder is about sports, there are probably about 4 times more sports-related words than weather-related words. Topic modeling is used to obtain the latent factors based on a statistical framework.

Latent Dirichlet Allocation (LDA) is a well-known generative model that was introduced in the article Latent Dirichlet Allocation (http://www.cs.columbia.edu/~blei/papers/BleiNgJordan2003.pdf). In an LDA model, the terms topic probabilities and topic-document probabilities are modeled with a Dirichlet distribution.

As a Bayesian method, the main advantage of LDA is that it is less susceptible to over-fitting and works well for smaller data sets. LDA has been successfully used in text modeling, content-based image retrieval, and bioinformatics.