• Corpus ID: 235458095

Author Clustering and Topic Estimation for Short Texts

@article{Tierney2021AuthorCA,
  title={Author Clustering and Topic Estimation for Short Texts},
  author={Graham Tierney and Christopher A. Bail and Alexander Volfovsky},
  journal={ArXiv},
  year={2021},
  volume={abs/2106.09533}
}
Analysis of short text, such as social media posts, is extremely difficult because of their inherent brevity. In addition to classifying topics of such posts, a common downstream task is grouping the authors of these documents for subsequent analyses. We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document, with user-level topic distributions. We also simultaneously cluster users, removing the need for post-hoc… 

References

SHOWING 1-10 OF 50 REFERENCES

The structural topic model and applied social science

TLDR
The Structural Topic Model (STM), a general way to incorporate corpus structure or document metadata into the standard topic model, is developed which accommodates corpus structure through document-level covariates affecting topical prevalence and/or topical content.

Short and Sparse Text Topic Modeling via Self-Aggregation

TLDR
A novel model integrating topic modeling with short text aggregation during topic inference is presented, founded on general topical affinity of texts rather than particular heuristics, making the model readily applicable to various short texts.

Unsupervised Topic Modeling for Short Texts Using Distributed Representations of Words

TLDR
An unsupervised topic model for short texts that performs soft clustering over distributed representations of words using Gaussian mixture models whose components capture the notion of latent topics and which outperforms LDA on short texts through both subjective and objective evaluation.

A biterm topic model for short texts

TLDR
The approach can discover more prominent and coherent topics, and significantly outperform baseline methods on several evaluation metrics, and is found that BTM can outperform LDA even on normal texts, showing the potential generality and wider usage of the new topic model.

A dirichlet multinomial mixture model-based approach for short text clustering

TLDR
This paper proposed a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model for short text clustering and found that GSDMM can infer the number of clusters automatically with a good balance between the completeness and homogeneity of the clustering results, and is fast to converge.

Combining IR and LDA Topic Modeling for Filtering Microblogs

Empirical study of topic modeling in Twitter

TLDR
It is shown that by training a topic model on aggregated messages the authors can obtain a higher quality of learned model which results in significantly better performance in two real-world classification problems.

Transferring topical knowledge from auxiliary long texts for short text clustering

TLDR
This article presents a novel approach to cluster short text messages via transfer learning from auxiliary long text data through a novel topic model - Dual Latent Dirichlet Allocation (DLDA) model, which jointly learns two sets of topics on short and long texts and couples the topic parameters to cope with the potential inconsistency between data sets.

Topic Modeling for Short Texts with Auxiliary Word Embeddings

TLDR
A simple, fast, and effective topic model for short texts, named GPU-DMM, based on the Dirichlet Multinomial Mixture model, which achieves comparable or better topic representations than state-of-the-art models, measured by topic coherence.

Structured Topic Models for Language

TLDR
This thesis introduces new methods for statistically modelling text using topic models that combine latent topics with information about document structure, ranging from local sentence structure to inter-document relationships.