• Corpus ID: 215812433

Reading Tea Leaves: How Humans Interpret Topic Models

  title={Reading Tea Leaves: How Humans Interpret Topic Models},
  author={Jonathan Chang and Jordan L. Boyd-Graber and Sean Gerrish and Chong Wang and David M. Blei},
Probabilistic topic models are a popular tool for the unsupervised analysis of text, providing both a predictive model of future text and a latent topic representation of the corpus. Practitioners typically assume that the latent space is semantically meaningful. It is used to check models, summarize the corpus, and guide exploration of its contents. However, whether the latent space is interpretable is in need of quantitative evaluation. In this paper, we present new quantitative methods for… 

Figures from this paper

Collapsed Gibbs Sampling Using Human Judgments

A novel task, tag-and-cluster, which asks subjects to simultaneously annotate documents and cluster those annotations is presented, and it is demonstrated that these topic models have features which distinguish them from traditional topic models.

Not-So-Latent Dirichlet Allocation: Collapsed Gibbs Sampling Using Human Judgments

A novel task, tag-and-cluster, which asks subjects to simultaneously annotate documents and cluster those annotations is presented, and it is demonstrated that these topic models have features which distinguish them from traditional topic models.

Incorporating Lexical Priors into Topic Models

This work proposes a simple and effective way to guide topic models to learn topics of specific interest to a user by providing sets of seed words that a user believes are representative of the underlying topics in a corpus.

Automatic Labelling of Topics via Analysis of User Summaries

A novel approach to extracting words and meaningful phrases from external user generated summaries as candidate labels and then rank them via the Kullback-Leibler semantic distance metric is proposed.

Keyword Assisted Embedded Topic Model

The Keyword Assisted Embedded Topic Model (KeyETM) is proposed, which equips ETM with the ability to incorporate user knowledge in the form of informative topic-level priors over the vocabulary.

Precision-Recall Balanced Topic Modelling

This work formulate topic modelling as an information retrieval task, where the goal is, based on the latent topic representation, to capture relevant term co-occurrence patterns and provides a statistical model that allows the user to balance between the contributions of the different error types.

A Bayesian Topic Model for Human-Evaluated Interpretability

This paper aims to improve interpretability in topic modeling by providing a novel, outperforming interpretable topic model that combines nonparametric and weakly-supervised topic models to create a complete, self-contained and outperforming topic model for interpretability.

Improving and Evaluating Topic Models and Other Models of Text

It is shown that words that are both frequent and exclusive to a theme are more effective at characterizing topical content, and a regularization scheme is proposed that leads to better estimates of these quantities.

Optimizing Semantic Coherence in Topic Models

A novel statistical topic model based on an automated evaluation metric based on this metric that significantly improves topic quality in a large-scale document collection from the National Institutes of Health (NIH).

Interpretability of API Call Topic Models: An Exploratory Study

The objective is to explore the coherence of topics and their ability to represent the themes of API calls from the perspective of malware analysts, and the results agree with topic coherence measures in terms of highest interpretable topics.



Automatic labeling of multinomial topic models

Probabilistic approaches to automatically labeling multinomial topic models in an objective way are proposed and can be applied to labeling topics learned through all kinds of topic models such as PLSA, LDA, and their variations.

Latent Dirichlet Allocation

Probabilistic Latent Semantic Analysis

This work proposes a widely applicable generalization of maximum likelihood model fitting by tempered EM, based on a mixture decomposition derived from a latent class model which results in a more principled approach which has a solid foundation in statistics.

Evaluation methods for topic models

It is demonstrated experimentally that commonly-used methods are unlikely to accurately estimate the probability of held-out documents, and two alternative methods that are both accurate and efficient are proposed.

Correlated Topic Models

The correlated topic model (CTM) is developed, where the topic proportions exhibit correlation via the logistic normal distribution and a mean-field variational inference algorithm is derived for approximate posterior inference in this model, which is complicated by the fact that the Logistic normal is not conjugate to the multinomial.

Organizing the OCA: learning faceted subjects from a library of digital books

DCM-LDA, a topic model based on Dirichlet Compound Multinomial distributions, is presented, which is simultaneously better able to represent observed properties of text and more scalable to extremely large text collections.

Studying the History of Ideas Using Topic Models

Unsupervised topic modeling is applied to the ACL Anthology to analyze historical trends in the field of Computational Linguistics from 1978 to 2006, finding trends including the rise of probabilistic methods starting in 1988, a steady increase in applications, and a sharp decline of research in semantics and understanding between 1978 and 2001.

LDA-based document models for ad-hoc retrieval

This paper proposes an LDA-based document model within the language modeling framework, and evaluates it on several TREC collections, and shows that improvements over retrieval using cluster-based models can be obtained with reasonable efficiency.

A Joint Model of Text and Aspect Ratings for Sentiment Summarization

A statistical model is proposed which is able to discover corresponding topics in text and extract textual evidence from reviews supporting each of these aspect ratings, a fundamental problem in aspect-based sentiment summarization.

Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks

This work explores the use of Amazon's Mechanical Turk system, a significantly cheaper and faster method for collecting annotations from a broad base of paid non-expert contributors over the Web, and proposes a technique for bias correction that significantly improves annotation quality on two tasks.