Reducing the sampling complexity of topic models

@article{Li2014ReducingTS,
  title={Reducing the sampling complexity of topic models},
  author={Aaron Q. Li and Amr Ahmed and Sujith Ravi and Alex Smola},
  journal={Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining},
  year={2014}
}
  • Aaron Q. Li, Amr Ahmed, Alex Smola
  • Published 24 August 2014
  • Computer Science
  • Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining
Inference in topic models typically involves a sampling step to associate latent variables with observations. Unfortunately the generative model loses sparsity as the amount of data increases, requiring O(k) operations per word for k topics. In this paper we propose an algorithm which scales linearly with the number of actually instantiated topics kd in the document. For large document collections and in structured hierarchical models kd ll k. This yields an order of magnitude speedup. Our… 

Figures from this paper

Scalable Collapsed Inference for High-Dimensional Topic Models
TLDR
This paper develops an online inference algorithm for topic models which leverages stochasticity to scale well in the number of documents, sparsity toscale well inThe number of topics, and which operates in the collapsed representation of the topic model for improved accuracy and run-time performance.
Linear Time Samplers for Supervised Topic Models using Compositional Proposals
TLDR
This work extends the recent sampling advances for unsupervised LDA models to supervised tasks and focuses on the Gibbs MedLDA model that is able to simultaneously discover latent structures and make accurate predictions, and is believed to be the first linear time sampling algorithm for supervised topic models.
Efficient Correlated Topic Modeling with Topic Embedding
TLDR
A new model which learns compact topic embeddings and captures topic correlations through the closeness between the topic vectors is proposed, enabling efficient inference in the low-dimensional embedding space.
Efficient Methods for Inferring Large Sparse Topic Hierarchies
TLDR
This paper introduces efficient methods for inferring large topic hierarchies using the Sparse Backoff Tree (SBT), a new prior for latent topic distributions that organizes the latent topics as leaves in a tree and introduces a collapsed sampler for the model that exploits sparsity and the tree structure to make inference efficient.
Sparse Parallel Training for Hierarchical Dirichlet Process Topic Models
TLDR
This work proposes a doubly sparse data-parallel sampler for the HDP topic model that addresses issues of scalability and sparsity in nonparametric extensions of topic models such as Latent Dirichlet Allocation.
Efficient Topic Modeling on Phrases via Sparsity
TLDR
A novel topic model SparseTP is proposed, which models the words and phrases by linking them in Markov Random Field when necessary, provides a well-formed lower bound of the model for Gibbs sampling, and utilizes the sparse distribution of words and phrase on topics to speed up the inference.
Multi-label classification using stacked hierarchical Dirichlet processes with reduced sampling complexity
TLDR
This work proposes a different proposal distribution for the MH step based on the observation that distributions on the upper hierarchy level change slower than the document-specific distributions at the lower level, which reduces the sampling complexity, making it linear in the number of topics per document by using an approximation based on Metropolis–Hastings sampling.
Scaling up Dynamic Topic Models
TLDR
This paper presents a fast and parallelizable inference algorithm using Gibbs Sampling with Stochastic Gradient Langevin Dynamics that does not make any unwarranted assumptions and is able to learn the largest Dynamic Topic Model to the authors' knowledge.
A Scalable Asynchronous Distributed Algorithm for Topic Modeling
TLDR
It is shown that F+Nomad LDA significantly outperforms recent state-of-the-art topic modeling approaches on massive problems which involve millions of documents, billions of words, and thousands of topics.
Efficient Methods for Incorporating Knowledge into Topic Models
TLDR
This work proposes a factor graph framework, Sparse Constrained LDA (SC-LDA), for efficiently incorporating prior knowledge into LDA, and evaluates its ability to incorporate word correlation knowledge and document label knowledge on three benchmark datasets.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 25 REFERENCES
Efficient methods for topic model inference on streaming document collections
TLDR
Empirical results indicate that SparseLDA can be approximately 20 times faster than traditional LDA and provide twice the speedup of previously published fast sampling methods, while also using substantially less memory.
Online Variational Inference for the Hierarchical Dirichlet Process
TLDR
This work proposes an online variational inference algorithm for the HDP, an algorithm that is easily applicable to massive and streaming data, and lets us analyze much larger data sets.
Latent Dirichlet Allocation
Nonparametric Bayes Pachinko Allocation
TLDR
A nonparametric Bayesian prior for PAM is proposed based on a variant of the hierarchical Dirichlet process (HDP), and it is shown that non Parametric PAM achieves performance matching the best of PAM without manually tuning the number of topics.
An architecture for parallel topic models
TLDR
This paper describes a high performance sampling architecture for inference of latent topic models on a cluster of workstations and shows that this architecture is entirely general and that it can be extended easily to more sophisticated latent variable models such as n-grams and hierarchies.
Differential Topic Models
TLDR
A differential topic model for this application that models both topic differences and similarities is presented and it is shown the model outperforms the state-of-the-art for document classification/ideology prediction on a number of text collections.
Finding scientific topics
  • T. Griffiths, M. Steyvers
  • Computer Science
    Proceedings of the National Academy of Sciences of the United States of America
  • 2004
TLDR
A generative model for documents is described, introduced by Blei, Ng, and Jordan, and a Markov chain Monte Carlo algorithm is presented for inference in this model, which is used to analyze abstracts from PNAS by using Bayesian model selection to establish the number of topics.
Topic models with power-law using Pitman-Yor process
TLDR
A novel topic model using the Pitman-Yor(PY) process is proposed, called the PY topic model, which captures two properties of a document; a power-law word distribution and the presence of multiple topics.
The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies
TLDR
An application to information retrieval in which documents are modeled as paths down a random tree, and the preferential attachment dynamics of the nCRP leads to clustering of documents according to sharing of topics at multiple levels of abstraction.
Word Features for Latent Dirichlet Allocation
We extend Latent Dirichlet Allocation (LDA) by explicitly allowing for the encoding of side information in the distribution over words. This results in a variety of new capabilities, such as improved
...
1
2
3
...