• Corpus ID: 6527691

Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis

@inproceedings{Tang2014UnderstandingTL,
  title={Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis},
  author={Jian Tang and Zhaoshi Meng and XuanLong Nguyen and Qiaozhu Mei and Ming Zhang},
  booktitle={ICML},
  year={2014}
}
Topic models such as the latent Dirichlet allocation (LDA) have become a standard staple in the modeling toolbox of machine learning. They have been applied to a vast variety of data sets, contexts, and tasks to varying degrees of success. However, to date there is almost no formal theory explicating the LDA's behavior, and despite its familiarity there is very little systematic analysis of and guidance on the properties of the data that affect the inferential performance of the model. This… 

Figures and Tables from this paper

Model Selection for Topic Models via Spectral Decomposition
TLDR
This work derives the upper bound and lower bound on the number of topics given a text collection of finite size under mild conditions and shows that its methodology can be easily generalized to model selection analysis for other latent models.
Guaranteed inference in topic models
TLDR
This paper introduces a provably fast algorithm, namely Online Maximum a Posteriori Estimation (OPE), for posterior inference in topic models, and employs OPE to design three methods for learning Latent Dirichlet Allocation from text streams or large corpora.
Examining the Coherence of the Top Ranked Tweet Topics
TLDR
Evidence is found that Twitter LDA outperforms both LDA and the tweet pooling method because the top ranked topics it generates have more coherence; it is demonstrated that a larger number of topics helps to generate topics with more coherent; and coherence at n is shown to be more effective when evaluating the coherence of a topic model than the average coherence score.
Inference for the Number of Topics in the Latent Dirichlet Allocation Model via Bayesian Mixture Modeling
  • Zhe Chen, Hani Doss
  • Computer Science
    Journal of Computational and Graphical Statistics
  • 2019
TLDR
A variant of the Metropolis–Hastings algorithm is presented that can be used to estimate the posterior distribution of the number of topics and it is evaluated on synthetic data and with procedures that are currently used in the machine learning literature.
Most large topic models are approximately separable
TLDR
It is proved that when the columns of the topic matrix are independently sampled from a Dirichlet distribution, the resulting topic matrix will be approximately separable with probability tending to one as the number of rows (vocabulary size) scales to infinity sufficiently faster than thenumber of columns (topics).
Dual online inference for latent Dirichlet allocation
TLDR
It is shown that OFW converges to some local optima, but under certain conditions it can converge to global optima and can be readily employed to accelerate the MAP estimation in a wide class of probabilistic models.
Anchored Correlation Explanation: Topic Modeling with Minimal Domain Knowledge
TLDR
Correlation Explanation is introduced, an alternative approach to topic modeling that does not assume an underlying generative model, and instead learns maximally informative topics through an information-theoretic framework that generalizes to hierarchical and semi-supervised extensions with no additional modeling assumptions.
Topic modeling in marketing: recent advances and research opportunities
TLDR
This work characterize extant contributions employing topic models in marketing along the dimensions data structures and retrieval of input data, implementation and extensions of basic topic models, and model performance evaluation, and confirms that there is considerable progress done in various marketing sub-areas.
Optimisation towards Latent Dirichlet Allocation: Its Topic Number and Collapsed Gibbs Sampling Inference Process
TLDR
The results show that the maximum likelihood and MDL approach result in the same number of optimal topics, and the highest average accuracy is 61% with alpha 0.1 and beta 0.001.
A Model of Text for Experimentation in the Social Sciences
TLDR
A hierarchical mixed membership model for analyzing topical content of documents, in which mixing weights are parameterized by observed covariates is posit, enabling researchers to introduce elements of the experimental design that informed document collection into the model, within a generally applicable framework.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 25 REFERENCES
Learning Topic Models -- Going beyond SVD
TLDR
This paper formally justifies Nonnegative Matrix Factorization (NMF) as a main tool in this context, which is an analog of SVD where all vectors are nonnegative, and gives the first polynomial-time algorithm for learning topic models without the above two limitations.
Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation
TLDR
It is proved that the difference in the tightness of the bound on the likelihood of a document decreases as O(k - 1) + √log m/m, where k is the number of topics in the model and m is thenumber of words in a document, and the advantage of CVB over VB is lost for long documents but increases with the numberof topics.
Finding scientific topics
  • T. Griffiths, M. Steyvers
  • Computer Science
    Proceedings of the National Academy of Sciences of the United States of America
  • 2004
TLDR
A generative model for documents is described, introduced by Blei, Ng, and Jordan, and a Markov chain Monte Carlo algorithm is presented for inference in this model, which is used to analyze abstracts from PNAS by using Bayesian model selection to establish the number of topics.
Rethinking LDA: Why Priors Matter
TLDR
The prior structure advocated substantially increases the robustness of topic models to variations in the number of topics and to the highly skewed word frequency distributions common in natural language.
Improving Topic Coherence with Regularized Topic Models
TLDR
This work proposes two methods to regularize the learning of topic models by creating a structured prior over words that reflect broad patterns in the external data that make topic models more useful across a broader range of text data.
Latent Dirichlet Allocation
Topic-link LDA: joint models of topic and author community
TLDR
A Bayesian hierarchical approach is developed that performs topic modeling and author community discovery in one unified framework and is demonstrated on two blog data sets in different domains and one research paper citation data from CiteSeer.
Evaluating topic models for digital libraries
TLDR
This large-scale user study includes over 70 human subjects evaluating and scoring almost 500 topics learned from collections from a wide range of genres and domains and shows how scoring model -- based on pointwise mutual information of word-pair using Wikipedia, Google and MEDLINE as external data sources - performs well at predicting human scores.
Empirical study of topic modeling in Twitter
TLDR
It is shown that by training a topic model on aggregated messages the authors can obtain a higher quality of learned model which results in significantly better performance in two real-world classification problems.
Organizing the OCA: learning faceted subjects from a library of digital books
TLDR
DCM-LDA, a topic model based on Dirichlet Compound Multinomial distributions, is presented, which is simultaneously better able to represent observed properties of text and more scalable to extremely large text collections.
...
1
2
3
...