• Corpus ID: 15107331

Generalized Topic Modeling

@article{Blum2016GeneralizedTM,
  title={Generalized Topic Modeling},
  author={Avrim Blum and Nika Haghtalab},
  journal={ArXiv},
  year={2016},
  volume={abs/1611.01259}
}
Recently there has been significant activity in developing algorithms with provable guarantees for topic modeling. In standard topic models, a topic (such as sports, business, or politics) is viewed as a probability distribution $\vec a_i$ over words, and a document is generated by first selecting a mixture $\vec w$ over topics, and then generating words i.i.d. from the associated mixture $A{\vec w}$. Given a large collection of such documents, the goal is to recover the topic vectors and then… 

Figures from this paper

Algorithms for Generalized Topic Modeling
TLDR
This work considers a broad generalization of the traditional topic modeling framework, where it no longer assumes that words are drawn i.i.d. and instead view a topic as a complex distribution over sequences of paragraphs, to learn a predictor that given a new document, accurately predicts its topic mixture, without learning the distributions explicitly.

References

SHOWING 1-10 OF 26 REFERENCES
A provable SVD-based algorithm for learning topics in dominant admixture corpus
TLDR
Under a more realistic assumption, a singular value decomposition (SVD) based algorithm with a crucial pre-processing step of thresholding, can provably recover the topics from a collection of documents drawn from Dominant admixtures.
A Spectral Algorithm for Latent Dirichlet Allocation
TLDR
This work provides a simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of multi-view models and topic models, including latent Dirichlet allocation (LDA).
Latent Dirichlet Allocation
Combining labeled and unlabeled data with co-training
TLDR
A PAC-style analysis is provided for a problem setting motivated by the task of learning to classify web pages, in which the description of each example can be partitioned into two distinct views, to allow inexpensive unlabeled data to augment, a much smaller set of labeled examples.
Co-Training and Expansion: Towards Bridging Theory and Practice
TLDR
A much weaker "expansion" assumption on the underlying data distribution is proposed, that is proved to be sufficient for iterative co-training to succeed given appropriately strong PAC-learning algorithms on each feature set, and that to some extent is necessary as well.
A Practical Algorithm for Topic Modeling with Provable Guarantees
TLDR
This paper presents an algorithm for topic model inference that is both provable and practical and produces results comparable to the best MCMC implementations while running orders of magnitude faster.
Tensor decompositions for learning latent variable models
TLDR
A detailed analysis of a robust tensor power method is provided, establishing an analogue of Wedin's perturbation theorem for the singular vectors of matrices, and implies a robust and computationally tractable estimation approach for several popular latent variable models.
Probabilistic Latent Semantic Analysis
TLDR
This work proposes a widely applicable generalization of maximum likelihood model fitting by tempered EM, based on a mixture decomposition derived from a latent class model which results in a more principled approach which has a solid foundation in statistics.
Settling the Polynomial Learnability of Mixtures of Gaussians
  • Ankur Moitra, G. Valiant
  • Computer Science
    2010 IEEE 51st Annual Symposium on Foundations of Computer Science
  • 2010
TLDR
This paper gives the first polynomial time algorithm for proper density estimation for mixtures of k Gaussians that needs no assumptions on the mixture, and proves that such a dependence is necessary.
Disentangling Gaussians
TLDR
The conclusion is that the statistical complexity and computational complexity of this general problem is in every way polynomial except for the dependence on the number of Gaussians, which is necessarily exponential.
...
...