• Corpus ID: 8853630

Incorporating Lexical Priors into Topic Models

  title={Incorporating Lexical Priors into Topic Models},
  author={Jagadeesh Jagarlamudi and Hal Daum{\'e} and Raghavendra Udupa},
Topic models have great potential for helping users understand document corpora. [] Key Method We achieve this by providing sets of seed words that a user believes are representative of the underlying topics in a corpus. Our model uses these seeds to improve both topic-word distributions (by biasing topics to produce appropriate seed words) and to improve document-topic distributions (by biasing documents to select topics related to the seed words they contain). Extrinsic evaluation on a document clustering…

Figures and Tables from this paper

A Guided Topic-Noise Model for Short Texts
The proposed Guided Topic-Noise Model (GTM), a semi-supervised topic model designed with large domain-specific social media data sets in mind, is presented, which uses a novel initialization and a new sampling algorithm called Generalized Polya Urn seed word sampling to produce a topic set that includes expanded seed topics, as well as new unsupervised topics.
Keyword Assisted Embedded Topic Model
The Keyword Assisted Embedded Topic Model (KeyETM) is proposed, which equips ETM with the ability to incorporate user knowledge in the form of informative topic-level priors over the vocabulary.
A novel topic model for documents by incorporating semantic relations between words
This paper develops a novel topic model—called Mixed Word Correlation Knowledge-based Latent Dirichlet Allocation—to infer latent topics from text corpus that mines two forms of lexical semantic knowledge based on recent progress in word embedding, which can represent semantic information of words in a continuous vector space.
Keyword Assisted Topic Models
It is empirically demonstrate that providing topic models with a small number of keywords can substantially improve their performance, and the proposed keyword assisted topic model (keyATM) provides more interpretable results, has better document classification performance and is less sensitive to the number of topics than the standard topic models.
Assessing topic model relevance: Evaluation and informative priors
This work proposes an additional topic quality metric that targets the stopword problem, and shows that it, unlike the standard measures, correctly correlates with human judgments of quality as defined by concentration of information‐rich words.
Source-LDA: Enhancing Probabilistic Topic Models Using Prior Knowledge Sources
A semisupervised Latent Dirichlet allocation (LDA) model, Source-LDA, is introduced, which incorporates prior knowledge to guide the topic modeling process to improve both the quality of the resulting topics and of the topic labeling.
Tackling topic general words in topic modeling
Incorporating Word Correlation Knowledge into Topic Modeling
A Markov Random Field regularized Latent Dirichlet Allocation model, which defines a MRF on the latent topic layer of LDA to encourage words labeled as similar to share the same topic label, and can accommodate the subtlety that whether two words are similar depends on which topic they appear in.
Promoting Domain-Specific Terms in Topic Models with Informative Priors
This work proposes a simple strategy for automatically promoting terms with domain relevance and demoting these domain-specific stop words, and increases the amount of domain-relevant content and reduces the appearance of canonical and humanevaluated stopwords in three very different domains.
Seed-Guided Topic Discovery with Out-of-Vocabulary Seeds
This paper proposes a novel framework, named S EE T OPIC , wherein the general knowledge of PLMs and the local semantics learned from the input corpus can mutually benefit each other.


Reading Tea Leaves: How Humans Interpret Topic Models
New quantitative methods for measuring semantic meaning in inferred topics are presented, showing that they capture aspects of the model that are undetected by previous measures of model quality based on held-out likelihood.
Topic modeling: beyond bag-of-words
A hierarchical generative probabilistic model that incorporates both n-gram statistics and latent topic variables by extending a unigram topic model to include properties of a hierarchical Dirichlet bigram language model is explored.
Interactive Topic Modeling
This work presents a new way of picking words to represent a topic, and presents a novel method for interactive topic modeling that allows the user to give live feedback on the topics, and allows the inference algorithm to use that feedback to guide the LDA parameter search.
Finding scientific topics
  • T. Griffiths, M. Steyvers
  • Computer Science
    Proceedings of the National Academy of Sciences of the United States of America
  • 2004
A generative model for documents is described, introduced by Blei, Ng, and Jordan, and a Markov chain Monte Carlo algorithm is presented for inference in this model, which is used to analyze abstracts from PNAS by using Bayesian model selection to establish the number of topics.
A Two-Dimensional Topic-Aspect Model for Discovering Multi-Faceted Topics
The Topic-Aspect Model is presented, a Bayesian mixture model which jointly discovers topics and aspects and can generate token assignments in both of these dimensions, rather than assuming words come from only one of two orthogonal models.
Supervised Topic Models
The supervised latent Dirichlet allocation (sLDA) model, a statistical model of labelled documents, is introduced, which derives a maximum-likelihood procedure for parameter estimation, which relies on variational approximations to handle intractable posterior expectations.
Latent Dirichlet Allocation
Incorporating domain knowledge into topic modeling via Dirichlet Forest priors
This work incorporates domain knowledge about the composition of words that should have high or low probability in various topics using a novel Dirichlet Forest prior in a LatentDirichlet Allocation framework.
A Topic Model for Word Sense Disambiguation
A probabilistic posterior inference algorithm for simultaneously disambiguating a corpus and learning the domains in which to consider each word is developed.
Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora
Labeled LDA is introduced, a topic model that constrains Latent Dirichlet Allocation by defining a one-to-one correspondence between LDA's latent topics and user tags that allows Labeled LDA to directly learn word-tag correspondences.