Optimizing Semantic Coherence in Topic Models


Latent variable models have the potential to add value to large document collections by discovering interpretable, low-dimensional subspaces. In order for people to use such models, however, they must trust them. Unfortunately , typical dimensionality reduction methods for text, such as latent Dirichlet allocation , often produce low-dimensional sub-spaces (topics) that are obviously flawed to human domain experts. The contributions of this paper are threefold: (1) An analysis of the ways in which topics can be flawed; (2) an automated evaluation metric for identifying such topics that does not rely on human annotators or reference collections outside the training data; (3) a novel statistical topic model based on this metric that significantly improves topic quality in a large-scale document collection from the National Institutes of Health (NIH).

Extracted Key Phrases

6 Figures and Tables

Showing 1-10 of 220 extracted citations
Citations per Year

491 Citations

Semantic Scholar estimates that this publication has received between 376 and 637 citations based on the available data.

See our FAQ for additional information.