Reducing the sampling complexity of topic models
@article{Li2014ReducingTS, title={Reducing the sampling complexity of topic models}, author={Aaron Q. Li and Amr Ahmed and Sujith Ravi and Alex Smola}, journal={Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining}, year={2014} }
Inference in topic models typically involves a sampling step to associate latent variables with observations. Unfortunately the generative model loses sparsity as the amount of data increases, requiring O(k) operations per word for k topics. In this paper we propose an algorithm which scales linearly with the number of actually instantiated topics kd in the document. For large document collections and in structured hierarchical models kd ll k. This yields an order of magnitude speedup. Our…
198 Citations
Scalable Collapsed Inference for High-Dimensional Topic Models
- Computer ScienceNAACL
- 2019
This paper develops an online inference algorithm for topic models which leverages stochasticity to scale well in the number of documents, sparsity toscale well inThe number of topics, and which operates in the collapsed representation of the topic model for improved accuracy and run-time performance.
Linear Time Samplers for Supervised Topic Models using Compositional Proposals
- Computer ScienceKDD
- 2015
This work extends the recent sampling advances for unsupervised LDA models to supervised tasks and focuses on the Gibbs MedLDA model that is able to simultaneously discover latent structures and make accurate predictions, and is believed to be the first linear time sampling algorithm for supervised topic models.
Efficient Correlated Topic Modeling with Topic Embedding
- Computer ScienceKDD
- 2017
A new model which learns compact topic embeddings and captures topic correlations through the closeness between the topic vectors is proposed, enabling efficient inference in the low-dimensional embedding space.
Efficient Methods for Inferring Large Sparse Topic Hierarchies
- Computer ScienceACL
- 2015
This paper introduces efficient methods for inferring large topic hierarchies using the Sparse Backoff Tree (SBT), a new prior for latent topic distributions that organizes the latent topics as leaves in a tree and introduces a collapsed sampler for the model that exploits sparsity and the tree structure to make inference efficient.
Sparse Parallel Training for Hierarchical Dirichlet Process Topic Models
- Computer ScienceEMNLP
- 2020
This work proposes a doubly sparse data-parallel sampler for the HDP topic model that addresses issues of scalability and sparsity in nonparametric extensions of topic models such as Latent Dirichlet Allocation.
Efficient Topic Modeling on Phrases via Sparsity
- Computer Science2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI)
- 2017
A novel topic model SparseTP is proposed, which models the words and phrases by linking them in Markov Random Field when necessary, provides a well-formed lower bound of the model for Gibbs sampling, and utilizes the sparse distribution of words and phrase on topics to speed up the inference.
Multi-label classification using stacked hierarchical Dirichlet processes with reduced sampling complexity
- Computer ScienceKnowledge and Information Systems
- 2018
This work proposes a different proposal distribution for the MH step based on the observation that distributions on the upper hierarchy level change slower than the document-specific distributions at the lower level, which reduces the sampling complexity, making it linear in the number of topics per document by using an approximation based on Metropolis–Hastings sampling.
Scaling up Dynamic Topic Models
- Computer ScienceWWW
- 2016
This paper presents a fast and parallelizable inference algorithm using Gibbs Sampling with Stochastic Gradient Langevin Dynamics that does not make any unwarranted assumptions and is able to learn the largest Dynamic Topic Model to the authors' knowledge.
A Scalable Asynchronous Distributed Algorithm for Topic Modeling
- Computer ScienceWWW
- 2015
It is shown that F+Nomad LDA significantly outperforms recent state-of-the-art topic modeling approaches on massive problems which involve millions of documents, billions of words, and thousands of topics.
Efficient Methods for Incorporating Knowledge into Topic Models
- Computer ScienceEMNLP
- 2015
This work proposes a factor graph framework, Sparse Constrained LDA (SC-LDA), for efficiently incorporating prior knowledge into LDA, and evaluates its ability to incorporate word correlation knowledge and document label knowledge on three benchmark datasets.
References
SHOWING 1-10 OF 25 REFERENCES
Efficient methods for topic model inference on streaming document collections
- Computer ScienceKDD
- 2009
Empirical results indicate that SparseLDA can be approximately 20 times faster than traditional LDA and provide twice the speedup of previously published fast sampling methods, while also using substantially less memory.
Online Variational Inference for the Hierarchical Dirichlet Process
- Computer ScienceAISTATS
- 2011
This work proposes an online variational inference algorithm for the HDP, an algorithm that is easily applicable to massive and streaming data, and lets us analyze much larger data sets.
Nonparametric Bayes Pachinko Allocation
- Computer ScienceUAI
- 2007
A nonparametric Bayesian prior for PAM is proposed based on a variant of the hierarchical Dirichlet process (HDP), and it is shown that non Parametric PAM achieves performance matching the best of PAM without manually tuning the number of topics.
An architecture for parallel topic models
- Computer ScienceProc. VLDB Endow.
- 2010
This paper describes a high performance sampling architecture for inference of latent topic models on a cluster of workstations and shows that this architecture is entirely general and that it can be extended easily to more sophisticated latent variable models such as n-grams and hierarchies.
Differential Topic Models
- Computer ScienceIEEE Transactions on Pattern Analysis and Machine Intelligence
- 2015
A differential topic model for this application that models both topic differences and similarities is presented and it is shown the model outperforms the state-of-the-art for document classification/ideology prediction on a number of text collections.
Finding scientific topics
- Computer ScienceProceedings of the National Academy of Sciences of the United States of America
- 2004
A generative model for documents is described, introduced by Blei, Ng, and Jordan, and a Markov chain Monte Carlo algorithm is presented for inference in this model, which is used to analyze abstracts from PNAS by using Bayesian model selection to establish the number of topics.
Topic models with power-law using Pitman-Yor process
- Computer ScienceKDD
- 2010
A novel topic model using the Pitman-Yor(PY) process is proposed, called the PY topic model, which captures two properties of a document; a power-law word distribution and the presence of multiple topics.
The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies
- Computer ScienceJACM
- 2010
An application to information retrieval in which documents are modeled as paths down a random tree, and the preferential attachment dynamics of the nCRP leads to clustering of documents according to sharing of topics at multiple levels of abstraction.
Word Features for Latent Dirichlet Allocation
- Computer ScienceNIPS
- 2010
We extend Latent Dirichlet Allocation (LDA) by explicitly allowing for the encoding of side information in the distribution over words. This results in a variety of new capabilities, such as improved…