Latent Dirichlet Allocation (LDA) Functions (ML Engine) - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
9.02
9.01
2.0
1.3
Published
February 2022
Language
English (United States)
Last Update
2022-02-10
dita:mapPath
rnn1580259159235.ditamap
dita:ditavalPath
ybt1582220416951.ditaval
dita:id
B700-4003
lifecycle
previous
Product Category
Teradata Vantageā„¢
Function Description
LDA (ML Engine) Uses training data and parameters to build topic model.
LDAInference (ML Engine) Uses topic model to estimate topic distribution in document set.
LDATopicSummary (ML Engine) Displays readable information from topic model.

Topic modeling, which is useful in text analysis, assumes that a document consists of multiple abstract topics with corresponding probabilities. Each topic emits a list of words with specific probability. That is, a word in a given document is output by a topic with certain probability decided by the topic, and the probability of the topic is decided according to the document.

In the model, a document can contain many topics. For example, the words "rainy" and "sunny," which are related to weather, and "basketball" and "football," which are related to sports. If 20% of a document is about weather and the remainder is about sports, there are probably about 4 times more sports-related words than weather-related words. Topic modeling is used to obtain the latent factors based on a statistical framework.

Latent Dirichlet Allocation (LDA) is a well-known generative model that was introduced in the article Latent Dirichlet Allocation (http://www.cs.columbia.edu/~blei/papers/BleiNgJordan2003.pdf). In an LDA model, the terms topic probabilities and topic-document probabilities are modeled with a Dirichlet distribution.

As a Bayesian method, the main advantage of LDA is that it is less susceptible to over-fitting and works well for smaller data sets. LDA has been successfully used in text modeling, content-based image retrieval, and bioinformatics.