Corpus ID: 235458095

Author Clustering and Topic Estimation for Short Texts

@article{Tierney2021AuthorCA,
  title={Author Clustering and Topic Estimation for Short Texts},
  author={Graham Tierney and C. Bail and A. Volfovsky},
  journal={ArXiv},
  year={2021},
  volume={abs/2106.09533}
}
Analysis of short text, such as social media posts, is extremely difficult because it relies on observing many document-level word co-occurrence pairs. Beyond topic distributions, a common downstream task of the modeling is grouping the authors of these documents for subsequent analyses. Traditional models estimate the document groupings and identify user clusters with an independent procedure. We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong… Expand

References

SHOWING 1-10 OF 43 REFERENCES
Unsupervised Topic Modeling for Short Texts Using Distributed Representations of Words
TLDR
An unsupervised topic model for short texts that performs soft clustering over distributed representations of words using Gaussian mixture models whose components capture the notion of latent topics and which outperforms LDA on short texts through both subjective and objective evaluation. Expand
A biterm topic model for short texts
TLDR
The approach can discover more prominent and coherent topics, and significantly outperform baseline methods on several evaluation metrics, and is found that BTM can outperform LDA even on normal texts, showing the potential generality and wider usage of the new topic model. Expand
Topic Modeling for Short Texts with Auxiliary Word Embeddings
TLDR
A simple, fast, and effective topic model for short texts, named GPU-DMM, based on the Dirichlet Multinomial Mixture model, which achieves comparable or better topic representations than state-of-the-art models, measured by topic coherence. Expand
A dirichlet multinomial mixture model-based approach for short text clustering
TLDR
This paper proposed a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model for short text clustering and found that GSDMM can infer the number of clusters automatically with a good balance between the completeness and homogeneity of the clustering results, and is fast to converge. Expand
Transferring topical knowledge from auxiliary long texts for short text clustering
TLDR
This article presents a novel approach to cluster short text messages via transfer learning from auxiliary long text data through a novel topic model - Dual Latent Dirichlet Allocation (DLDA) model, which jointly learns two sets of topics on short and long texts and couples the topic parameters to cope with the potential inconsistency between data sets. Expand
Short and Sparse Text Topic Modeling via Self-Aggregation
TLDR
A novel model integrating topic modeling with short text aggregation during topic inference is presented, founded on general topical affinity of texts rather than particular heuristics, making the model readily applicable to various short texts. Expand
The structural topic model and applied social science
We develop the Structural Topic Model which provides a general way to incorporate corpus structure or document metadata into the standard topic model. Document-level covariates enter the modelExpand
Combining IR and LDA Topic Modeling for Filtering Microblogs
TLDR
A novel method to improve topics learned from Twitter content without modifying the basic machinery of LDA is investigated, based on a pooling process which combines Information retrieval (IR) approach and LDA. Expand
Structured Topic Models for Language
TLDR
This thesis introduces new methods for statistically modelling text using topic models that combine latent topics with information about document structure, ranging from local sentence structure to inter-document relationships. Expand
The Author-Topic Model for Authors and Documents
TLDR
The author-topic model is introduced, a generative model for documents that extends Latent Dirichlet Allocation to include authorship information, and applications to computing similarity between authors and entropy of author output are demonstrated. Expand
...
1
2
3
4
5
...