Reading Tea Leaves: How Humans Interpret Topic Models
@inproceedings{Chang2009ReadingTL, title={Reading Tea Leaves: How Humans Interpret Topic Models}, author={Jonathan Chang and Jordan L. Boyd-Graber and Sean Gerrish and Chong Wang and David M. Blei}, booktitle={NIPS}, year={2009} }
Probabilistic topic models are a popular tool for the unsupervised analysis of text, providing both a predictive model of future text and a latent topic representation of the corpus. Practitioners typically assume that the latent space is semantically meaningful. It is used to check models, summarize the corpus, and guide exploration of its contents. However, whether the latent space is interpretable is in need of quantitative evaluation. In this paper, we present new quantitative methods for…
1,940 Citations
Collapsed Gibbs Sampling Using Human Judgments
- Computer Science
- 2010
A novel task, tag-and-cluster, which asks subjects to simultaneously annotate documents and cluster those annotations is presented, and it is demonstrated that these topic models have features which distinguish them from traditional topic models.
Not-So-Latent Dirichlet Allocation: Collapsed Gibbs Sampling Using Human Judgments
- Computer ScienceMturk@HLT-NAACL
- 2010
A novel task, tag-and-cluster, which asks subjects to simultaneously annotate documents and cluster those annotations is presented, and it is demonstrated that these topic models have features which distinguish them from traditional topic models.
Incorporating Lexical Priors into Topic Models
- Computer ScienceEACL
- 2012
This work proposes a simple and effective way to guide topic models to learn topics of specific interest to a user by providing sets of seed words that a user believes are representative of the underlying topics in a corpus.
Automatic Labelling of Topics via Analysis of User Summaries
- Computer ScienceADC
- 2016
A novel approach to extracting words and meaningful phrases from external user generated summaries as candidate labels and then rank them via the Kullback-Leibler semantic distance metric is proposed.
Keyword Assisted Embedded Topic Model
- Computer ScienceWSDM
- 2022
The Keyword Assisted Embedded Topic Model (KeyETM) is proposed, which equips ETM with the ability to incorporate user knowledge in the form of informative topic-level priors over the vocabulary.
Precision-Recall Balanced Topic Modelling
- Computer ScienceNeurIPS
- 2019
This work formulate topic modelling as an information retrieval task, where the goal is, based on the latent topic representation, to capture relevant term co-occurrence patterns and provides a statistical model that allows the user to balance between the contributions of the different error types.
A Bayesian Topic Model for Human-Evaluated Interpretability
- Computer ScienceLREC
- 2022
This paper aims to improve interpretability in topic modeling by providing a novel, outperforming interpretable topic model that combines nonparametric and weakly-supervised topic models to create a complete, self-contained and outperforming topic model for interpretability.
Improving and Evaluating Topic Models and Other Models of Text
- Computer Science
- 2016
It is shown that words that are both frequent and exclusive to a theme are more effective at characterizing topical content, and a regularization scheme is proposed that leads to better estimates of these quantities.
Optimizing Semantic Coherence in Topic Models
- Computer ScienceEMNLP
- 2011
A novel statistical topic model based on an automated evaluation metric based on this metric that significantly improves topic quality in a large-scale document collection from the National Institutes of Health (NIH).
Interpretability of API Call Topic Models: An Exploratory Study
- Computer ScienceHICSS
- 2020
The objective is to explore the coherence of topics and their ability to represent the themes of API calls from the perspective of malware analysts, and the results agree with topic coherence measures in terms of highest interpretable topics.
References
SHOWING 1-10 OF 27 REFERENCES
Automatic labeling of multinomial topic models
- Computer ScienceKDD '07
- 2007
Probabilistic approaches to automatically labeling multinomial topic models in an objective way are proposed and can be applied to labeling topics learned through all kinds of topic models such as PLSA, LDA, and their variations.
Probabilistic Latent Semantic Analysis
- Computer ScienceUAI
- 1999
This work proposes a widely applicable generalization of maximum likelihood model fitting by tempered EM, based on a mixture decomposition derived from a latent class model which results in a more principled approach which has a solid foundation in statistics.
Evaluation methods for topic models
- Computer ScienceICML '09
- 2009
It is demonstrated experimentally that commonly-used methods are unlikely to accurately estimate the probability of held-out documents, and two alternative methods that are both accurate and efficient are proposed.
Correlated Topic Models
- Computer ScienceNIPS
- 2005
The correlated topic model (CTM) is developed, where the topic proportions exhibit correlation via the logistic normal distribution and a mean-field variational inference algorithm is derived for approximate posterior inference in this model, which is complicated by the fact that the Logistic normal is not conjugate to the multinomial.
Organizing the OCA: learning faceted subjects from a library of digital books
- Computer ScienceJCDL '07
- 2007
DCM-LDA, a topic model based on Dirichlet Compound Multinomial distributions, is presented, which is simultaneously better able to represent observed properties of text and more scalable to extremely large text collections.
Studying the History of Ideas Using Topic Models
- Computer ScienceEMNLP
- 2008
Unsupervised topic modeling is applied to the ACL Anthology to analyze historical trends in the field of Computational Linguistics from 1978 to 2006, finding trends including the rise of probabilistic methods starting in 1988, a steady increase in applications, and a sharp decline of research in semantics and understanding between 1978 and 2001.
LDA-based document models for ad-hoc retrieval
- Computer ScienceSIGIR
- 2006
This paper proposes an LDA-based document model within the language modeling framework, and evaluates it on several TREC collections, and shows that improvements over retrieval using cluster-based models can be obtained with reasonable efficiency.
A Joint Model of Text and Aspect Ratings for Sentiment Summarization
- Computer ScienceACL
- 2008
A statistical model is proposed which is able to discover corresponding topics in text and extract textual evidence from reviews supporting each of these aspect ratings, a fundamental problem in aspect-based sentiment summarization.
Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks
- Computer ScienceEMNLP
- 2008
This work explores the use of Amazon's Mechanical Turk system, a significantly cheaper and faster method for collecting annotations from a broad base of paid non-expert contributors over the Web, and proposes a technique for bias correction that significantly improves annotation quality on two tasks.