Background - Aster Analytics

Teradata Aster® Analytics Foundation User GuideUpdate 2

Product

Aster Analytics

Release Number

7.00.02

Published

September 2017

Language

English (United States)

Last Update

2018-04-17

dita:mapPath

uce1497542673292.ditamap

dita:ditavalPath

AA-notempfilter_pdf_output.ditaval

dita:id

B700-1022

lifecycle

Product Category

Software

Topic modeling, which is useful in text analysis, assumes that a document consists of multiple abstract topics with corresponding probabilities. Each topic emits a list of words with specific probability. That is, a word in a given document is generated by a topic with certain probability decided by the topic, and the probability of the topic is decided according to the document.

In the model, a document can contain many topics. For example, the words "rainy" and "sunny," which are related to weather, and "basketball" and "football," which are related to sports. If 20% of a document is about weather and the remainder is about sports, there are probably about 4 times more sports-related words than weather-related words. Topic modeling is used to obtain the latent factors based on a statistical framework.

Latent Dirichlet Allocation (LDA) is a well-known generative model that was introduced in the article Latent Dirichlet Allocation (http://www.cs.columbia.edu/~blei/papers/BleiNgJordan2003.pdf). In an LDA model, the terms topic probabilities and topic-document probabilities are modeled with a Dirichlet distribution.

As a Bayesian method, the main advantage of LDA is that it is less susceptible to overfitting and works well for smaller datasets. LDA has been successfully used in text modeling, content-based image retrieval, and bioinformatics.