Optimizing Semantic Coherence in Topic Models


Latent variable models have the potential to add value to large document collections by discovering interpretable, low-dimensional subspaces. In order for people to use such models, however, they must trust them. Unfortunately, typical dimensionality reduction methods for text, such as latent Dirichlet allocation, often produce low-dimensional subspaces (topics) that are obviously flawed to human domain experts. The contributions of this paper are threefold: (1) An analysis of the ways in which topics can be flawed; (2) an automated evaluation metric for identifying such topics that does not rely on human annotators or reference collections outside the training data; (3) a novel statistical topic model based on this metric that significantly improves topic quality in a large-scale document collection from the National Institutes of Health (NIH).

Extracted Key Phrases

4 Figures and Tables

Citations per Year

532 Citations

Semantic Scholar estimates that this publication has 532 citations based on the available data.

See our FAQ for additional information.

Cite this paper

@inproceedings{Mimno2011OptimizingSC, title={Optimizing Semantic Coherence in Topic Models}, author={David M. Mimno and Hanna M. Wallach and Edmund M. Talley and Miriam Leenders and Andrew McCallum}, booktitle={EMNLP}, year={2011} }