Function | Description |
---|---|
LDA | Uses training data and parameters to build topic model. |
LDAInference | Uses topic model to estimate topic distribution in document set. |
LDATopicSummary | Displays readable information from topic model. |
Topic modeling, which is useful in text analysis, assumes that a document consists of multiple abstract topics with corresponding probabilities. Each topic emits a list of words with specific probability. That is, a word in a given document is output by a topic with certain probability decided by the topic, and the probability of the topic is decided according to the document.
In the model, a document can contain many topics. For example, the words "rainy" and "sunny," which are related to weather, and "basketball" and "football," which are related to sports. If 20% of a document is about weather and the remainder is about sports, there are probably about 4 times more sports-related words than weather-related words. Topic modeling is used to obtain the latent factors based on a statistical framework.
Latent Dirichlet Allocation (LDA) is a well-known generative model that was introduced in the article Latent Dirichlet Allocation (http://www.cs.columbia.edu/~blei/papers/BleiNgJordan2003.pdf). In an LDA model, the terms topic probabilities and topic-document probabilities are modeled with a Dirichlet distribution.
As a Bayesian method, the main advantage of LDA is that it is less susceptible to over-fitting and works well for smaller data sets. LDA has been successfully used in text modeling, content-based image retrieval, and bioinformatics.