Topic Grouper: An Agglomerative Clustering Approach to Topic Modeling

  title={Topic Grouper: An Agglomerative Clustering Approach to Topic Modeling},
  author={Daniel Pfeifer and Jochen L. Leidner},
We introduce Topic Grouper as a complementary approach in the field of probabilistic topic modeling. [...] Key Method The algorithm starts with one-word topics and joins two topics at every step. It therefore generates a solution for every desired number of topics ranging between the size of the training vocabulary and one. The process represents an agglomerative clustering that corresponds to a binary tree of topics. A resulting tree may act as a containment hierarchy, typically with more general topics…Expand
4 Citations
Effective interrelation of Bayesian nonparametric document clustering and embedded-topic modeling
The devised approach exploits Bayesian generative modeling and posterior inference, to seamlessly unify and jointly carry out the two tasks, respectively, and formulates an unprecedented interrelationship of word-embedding topics with a Dirichlet process mixture of cluster components.
Novel semantic tagging detection algorithms based non-negative matrix factorization
A novel learning tagging model called semantic non-negative matrix factorization is proposed, which introduces the utilization of the semantic text representation via knowledge-based approach to extract the term-topic matrix and the topic-document matrix by semantically approach.
Nobody Said it Would be Easy: A Decade of R&D Projects in Information Access from Thomson over Reuters to Refinitiv
A critical assessment of what academia can and cannot do for industry, and what industry can do for research in terms of R&D efforts are attempted in this talk.


Modeling topic hierarchies with the recursive chinese restaurant process
This work introduces the recursive Chinese restaurant process (rCRP) and a nonparametric topic model with rCRP as a prior for discovering a hierarchical topic structure with unbounded depth and width and suggests two metrics that quantify the characteristics of a topic hierarchy to compare the discovered topic hierarchies of r CRP and nCRP.
Pachinko allocation: DAG-structured mixture models of topic correlations
Improved performance of PAM is shown in document classification, likelihood of held-out data, the ability to support finer-grained topics, and topical keyword coherence.
Hierarchical Latent Tree Analysis for Topic Detection
A new method for topic detection, where a topic is determined by identifying words that appear with high frequency in the topic and low frequency in other topics, is proposed using a hierarchy of discrete latent variables.
Hierarchical Topic Models and the Nested Chinese Restaurant Process
A Bayesian approach is taken to generate an appropriate prior via a distribution on partitions that allows arbitrarily large branching factors and readily accommodates growing data collections.
TopicNets: Visual Analysis of Large Text Corpora with Topic Modeling
A discussion of the design and implementation choices for each visual analysis technique is presented, followed by a discussion of three diverse use cases in which TopicNets enables fast discovery of information that is otherwise hard to find.
Mixtures of hierarchical topics with Pachinko allocation
H hierarchical PAM is presented---an enhancement that explicitly represents a topic hierarchy that can be seen as combining the advantages of hLDA's topical hierarchy representation with PAM's ability to mix multiple leaves of the topic hierarchy.
Hierarchical Dirichlet Processes
We consider problems involving groups of data where each observation within a group is a draw from a mixture model and where it is desirable to share mixture components between groups. We assume that
Topic-weak-correlated Latent Dirichlet allocation
  • Yi-Shiuan Tan, Zhijian Ou
  • Computer Science
    2010 7th International Symposium on Chinese Spoken Language Processing
  • 2010
Experimental results on both synthetic and real-world corpus show the superiority of the TWC-LDA over the basic LDA for semantically meaningful topic discovery and document classification.
Finding scientific topics
  • T. Griffiths, M. Steyvers
  • Computer Science, Medicine
    Proceedings of the National Academy of Sciences of the United States of America
  • 2004
A generative model for documents is described, introduced by Blei, Ng, and Jordan, and a Markov chain Monte Carlo algorithm is presented for inference in this model, which is used to analyze abstracts from PNAS by using Bayesian model selection to establish the number of topics.
Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes
The hierarchical Dirichlet process (HDP), a nonparametric Bayesian model for clustering problems involving multiple groups of data, is proposed and experimental results are reported showing the effective and superior performance of the HDP over previous models.