Latent Dirichlet Allocation (LDA) Functions (ML Engine)

Latent Dirichlet Allocation (LDA) Functions (ML Engine) - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product

Teradata Vantage

Release Number

8.10

1.1

Published

October 2019

Language

English (United States)

Last Update

2019-12-31

dita:mapPath

ima1540829771750.ditamap

dita:ditavalPath

jsj1481748799576.ditaval

dita:id

B700-4003

lifecycle

Product Category

Teradata Vantage™

Function	Description
LDA (ML Engine)	Uses training data and parameters to build topic model.
LDAInference (ML Engine)	Uses topic model to estimate topic distribution in document set.
LDATopicSummary (ML Engine)	Displays readable information from topic model.

Topic modeling, which is useful in text analysis, assumes that a document consists of multiple abstract topics with corresponding probabilities. Each topic emits a list of words with specific probability. That is, a word in a given document is output by a topic with certain probability decided by the topic, and the probability of the topic is decided according to the document.

In the model, a document can contain many topics. For example, the words "rainy" and "sunny," which are related to weather, and "basketball" and "football," which are related to sports. If 20% of a document is about weather and the remainder is about sports, there are probably about 4 times more sports-related words than weather-related words. Topic modeling is used to obtain the latent factors based on a statistical framework.

Latent Dirichlet Allocation (LDA) is a well-known generative model that was introduced in the article Latent Dirichlet Allocation (http://www.cs.columbia.edu/~blei/papers/BleiNgJordan2003.pdf). In an LDA model, the terms topic probabilities and topic-document probabilities are modeled with a Dirichlet distribution.

As a Bayesian method, the main advantage of LDA is that it is less susceptible to over-fitting and works well for smaller data sets. LDA has been successfully used in text modeling, content-based image retrieval, and bioinformatics.