Learning Topic Models -- Going beyond SVD

  title={Learning Topic Models -- Going beyond SVD},
  author={Sanjeev Arora and Rong Ge and Ankur Moitra},
  journal={2012 IEEE 53rd Annual Symposium on Foundations of Computer Science},
Topic Modeling is an approach used for automatic comprehension and classification of data in a variety of settings, and perhaps the canonical application is in uncovering thematic structure in a corpus of documents. A number of foundational works both in machine learning and in theory have suggested a probabilistic model for documents, whereby documents arise as a convex combination of (i.e. distribution on) a small number of topic vectors, each topic vector being a distribution on words (i.e… 

Figures from this paper

An Embedding-Based Topic Model for Document Classification
This article presents a two-stage algorithm for topic modelling that leverages word embeddings and word co-occurrence and demonstrates the remarkable comparative effectiveness of the proposed algorithm in a task of document classification.
A Spectral Algorithm for Latent Dirichlet Allocation
This work provides a simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of multi-view models and topic models, including latent Dirichlet allocation (LDA).
Generalized Topic Modeling
This work aims to learn a predictor that given a new document, accurately predicts its topic mixture, without learning the distributions explicitly, and can be viewed as a generalization of the multi-view or co-training setting in machine learning.
Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis
This paper presents theorems elucidating the posterior contraction rates of the topics as the amount of data increases, and a thorough supporting empirical study using synthetic and real data sets, including news and web-based articles and tweet messages.
Assigning Topics to Documents by Successive Projections
One of the conclusions is that the error of the algorithm grows at most logarithmically with the size of the dictionary, in contrast to what one observes for Latent Dirichlet Allocation.
How Many Topics? Stability Analysis for Topic Models
Using a topic modeling approach based on matrix factorization, evaluations performed on a range of corpora show that this strategy can successfully guide the model selection process.
Performance of LDA and DCT models
The Doubly Correlated Topic Model uses the highest ranked co-occurred words as initial topics rather than obtaining from Dirichlet priors in its posterior inference, suggesting some improved performance on entropy and topical coherence over different datasets.
Stability of topic modeling via matrix factorization


Two SVDs Suffice: Spectral decompositions for probabilistic topic modeling and latent Dirichlet allocation
A simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of mixture models, including the popular latent Dirichlet allocation (LDA) model, based on a spectral decomposition of low order moments via two singular value decompositions (SVDs).
A correlated topic model of Science
The correlated topic model (CTM) is developed, where the topic proportions exhibit correlation via the logistic normal distribution, and it is demonstrated its use as an exploratory tool of large document collections.
Introduction to Probabilistic Topic Models
The main ideas of this field are reviewed, the current state-of-the-art is surveyed, the growing body of research that extends and applies topic models in interesting ways are surveyed, and some promising future directions are described.
Probabilistic topic models
  • D. Blei
  • Computer Science
    Commun. ACM
  • 2010
Surveying a suite of algorithms that offer a solution to managing large document archives suggests they are well-suited to handle large amounts of data.
Latent Dirichlet Allocation
Pachinko allocation: DAG-structured mixture models of topic correlations
Improved performance of PAM is shown in document classification, likelihood of held-out data, the ability to support finer-grained topics, and topical keyword coherence.
Complexity of Inference in Latent Dirichlet Allocation
This work studies the problem of finding the maximum a posteriori (MAP) assignment of topics to words, where the document's topic distribution is integrated out, and shows that, when the effective number of topics per document is small, exact inference takes polynomial time, and that this problem is NP-hard.
Dynamic topic models
A family of probabilistic time series models is developed to analyze the time evolution of topics in large document collections, and dynamic topic models provide a qualitative window into the contents of a large document collection.
Probabilistic Latent Semantic Analysis
This work proposes a widely applicable generalization of maximum likelihood model fitting by tempered EM, based on a mixture decomposition derived from a latent class model which results in a more principled approach which has a solid foundation in statistics.
Spectral analysis of data
A model for framing data mining tasks and a unified approach to solving the resulting data mining problems using spectral analysis are presented, which give strong justification to the use of spectral techniques for latent semantic indexing, collaborative filtering, and web site ranking.