Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for Fast and Good Topics Too!

  title={Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for Fast and Good Topics Too!},
  author={Suzanna Sia and Ayush Dalmia and Sabrina J. Mielke},
Topic models are a useful analysis tool to uncover the underlying themes within document collections. Probabilistic models which assume a generative story have been the dominant approach for topic modeling. We propose an alternative approach based on clustering readily available pre-trained word embeddings while incorporating document information for weighted clustering and reranking top words. We provide benchmarks for the combination of different word embeddings and clustering algorithms, and… Expand
Topic Modeling with Contextualized Word Representation Clusters
These cluster models are simple, reliable, and can perform as well as, if not better than, LDA topic models, maintaining high topic quality even when the number of topics is large relative to the size of the local collection. Expand
Analysis and Prediction of NLP models via Task Embeddings
  • 2021
Relatedness between tasks, which is key to transfer learning, is often characterized by measuring the influence of tasks on one another during sequential or simultaneous training, with tasks beingExpand
Cluster Analysis of Online Mental Health Discourse using Topic-Infused Deep Contextualized Representations
With mental health as a problem domain in NLP, the bulk of contemporary literature revolves around building better mental illness prediction models. The research focusing on the identification ofExpand
Dynamic Contextualized Word Embeddings
Static word embeddings that represent words by a single vector cannot capture the variability of word meaning in different linguistic and extralinguistic contexts. Building on prior work onExpand
Event-Driven News Stream Clustering using Entity-Aware Contextual Embeddings
It is shown that the use of a suitable fine-tuning objective and external knowledge in pre-trained transformer models yields significant improvements in the effectiveness of contextual embeddings for clustering. Expand
On the Power of Pre-Trained Text Representations: Models and Applications in Text Mining
Recent years have witnessed the enormous success of text representation learning in a wide range of text mining tasks. Earlier word embedding learning approaches represent words as fixedExpand
Superbizarre Is Not Superb: Improving BERT's Interpretations of Complex Words with Derivational Morphology
It is shown that PLMs can be interpreted as serial dual-route models, i.e., the meanings of complex words are either stored or else need to be computed from the subwords, which implies that maximally meaningful input tokens should allow for the best generalization on new words. Expand
TAN-NTM: Topic Attention Networks for Neural Topic Modeling
A novel framework: TAN-NTM is proposed which models document as a sequence of tokens instead of BoW at the input layer and processes it through an LSTM whose output is used to perform variational inference followed by BoW decoding and achieves state-of-the-art performance on topic aware supervised generation of keyphrases on StackExchange and Weibo datasets. Expand
Unsupervised Graph-based Topic Modeling from Video Transcriptions
This paper aims at developing a topic extractor on video transcriptions by exploiting neural word embeddings through a graph-based clustering method, and demonstrates the generalisability of this approach on a pure text review data set. Expand


Reading Tea Leaves: How Humans Interpret Topic Models
New quantitative methods for measuring semantic meaning in inferred topics are presented, showing that they capture aspects of the model that are undetected by previous measures of model quality based on held-out likelihood. Expand
Language Modeling by Clustering with Word Embeddings for Text Readability Assessment
It is argued that clustering with word embeddings in the metric space should yield feature representations in a higher semantic space appropriate for text regression, and that by representing features in terms of histograms, this approach can naturally address documents of varying lengths. Expand
Distributed Document and Phrase Co-embeddings for Descriptive Clustering
A descriptive clustering approach that employs a distributed representation model, namely the paragraph vector model, to capture semantic similarities between documents and phrases to achieve superior performance over the existing approach in both identifying clusters and assigning appropriate descriptive labels to them. Expand
Detecting Topics in Documents by Clustering Word Vectors
The use of a Self-Organizing Map (SOM) to cluster the word vectors generated by Word2Vec so as to find topics in the texts so that the words mapped into each centroid represent the topics of that cluster. Expand
Latent Dirichlet Allocation
We propose a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams [6], andExpand
GloVe: Global Vectors for Word Representation
A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure. Expand
Spherical Text Embedding
This work proposes a spherical generative model based on which unsupervised word and paragraph embeddings are jointly learned in the spherical space and develops an efficient optimization algorithm with convergence guarantee based on Riemannian optimization. Expand
Classification and Clustering of Arguments with Contextualized Word Embeddings
For the first time, it is shown how to leverage the power of contextualized word embeddings to classify and cluster topic-dependent arguments, achieving impressive results on both tasks and across multiple datasets. Expand
Applications of Topic Models
Applications of Topic Models describes the recent academic and industrial applications of topic models and reviews their successful use by researchers to help understand fiction, non-fiction, scientific publications, and political texts. Expand
Efficient Estimation of Word Representations in Vector Space
Two novel model architectures for computing continuous vector representations of words from very large data sets are proposed and it is shown that these vectors provide state-of-the-art performance on the authors' test set for measuring syntactic and semantic word similarities. Expand