Topic Modeling over Short Texts by Incorporating Word Embeddings

  title={Topic Modeling over Short Texts by Incorporating Word Embeddings},
  author={Jipeng Qiang and Ping Chen and Tong Wang and Xindong Wu},
Inferring topics from the overwhelming amount of short texts becomes a critical but challenging task for many content analysis tasks, such as content charactering, user interest profiling, and emerging topic detecting. [] Key Method Based on recent results in word embeddings that learn se- mantically representations for words from a large corpus, we introduce a novel method, Embedding-based Topic Model (ETM), to learn latent topics from short texts.

Incorporating Biterm Correlation Knowledge into Topic Modeling for Short Texts

This paper develops a novel topic model—called biterm correlation knowledge-based topic model (BCK-TM)—to infer latent topics from short texts based on recent progress in word embedding, which can represent semantic information of words in a continuous vector space.

Short Text Topic Modeling Techniques, Applications, and Performance: A Survey

This survey conducts a comprehensive review of various short text topic modeling techniques proposed in the literature, and presents three categories of methods based on Dirichlet multinomial mixture, global word co-occurrences, and self-aggregation, with example of representative approaches in each category and analysis of their performance on various tasks.

Investigating the Efficient Use of Word Embedding with Neural-Topic Models for Interpretable Topics from Short Texts

Though auxiliary word embedding with a large external corpus improves the topic coherency of short texts, an additional fine-tuning stage is needed for generating more corpus-specific topics from short-text data.

A novel topic model for documents by incorporating semantic relations between words

This paper develops a novel topic model—called Mixed Word Correlation Knowledge-based Latent Dirichlet Allocation—to infer latent topics from text corpus that mines two forms of lexical semantic knowledge based on recent progress in word embedding, which can represent semantic information of words in a continuous vector space.

GLTM: A Global and Local Word Embedding-Based Topic Model for Short Texts

A novel global and local word embedding-based topic model (GLTM) for short texts that can distill semantic relatedness information between words which can be further leveraged by Gibbs sampler in the inference process to strengthen semantic coherence of topics.

An Embedding-based Joint Sentiment-Topic Model for Short Texts

ELJST is an embedding enhanced generative joint sentiment-topic model that can discover more coherent and diverse topics from short texts and helps understand users’ behaviour at more granular levels which can be explained.

A Detailed Survey on Topic Modeling for Document and Short Text Data

A detailed survey covering the various topic modeling techniques proposed in last decade is presented, which focuses on different strategies of extracting the topics in social media text, where the goal is to find and aggregate the topic within short texts.

A Guided Topic-Noise Model for Short Texts

The proposed Guided Topic-Noise Model (GTM), a semi-supervised topic model designed with large domain-specific social media data sets in mind, is presented, which uses a novel initialization and a new sampling algorithm called Generalized Polya Urn seed word sampling to produce a topic set that includes expanded seed topics, as well as new unsupervised topics.

Research on Improve Topic Representation over Short Text

Although the LF-DMM model incorporatesword embedding, it performs poorly on short text, and the performance of DMM and BTM integrated with word embedding improve greatly.

ASTM: An Attentional Segmentation Based Topic Model for Short Texts

This work proposes a novel model, Attentional Segmentation based Topic Model (ASTM), to integrate both word embeddings as supplementary information and an attention mechanism that segments short text documents into fragments of adjacent words receiving similar attention.



BTM: Topic Modeling over Short Texts

This paper proposes a novel way for short text topic modeling, referred as biterm topic model (BTM), which learns topics by directly modeling the generation of word co-occurrence patterns in the corpus, making the inference effective with the rich corpus-level information.

Incorporating Word Correlation Knowledge into Topic Modeling

A Markov Random Field regularized Latent Dirichlet Allocation model, which defines a MRF on the latent topic layer of LDA to encourage words labeled as similar to share the same topic label, and can accommodate the subtlety that whether two words are similar depends on which topic they appear in.

Topic Discovery from Heterogeneous Texts

An innovative method to discover latent topics from a heterogeneous corpus including both long and short texts is presented and a new topic model based on collapsed Gibbs sampling algorithm is developed for modeling such heterogeneous texts.

Transferring topical knowledge from auxiliary long texts for short text clustering

This article presents a novel approach to cluster short text messages via transfer learning from auxiliary long text data through a novel topic model - Dual Latent Dirichlet Allocation (DLDA) model, which jointly learns two sets of topics on short and long texts and couples the topic parameters to cope with the potential inconsistency between data sets.

Short and Sparse Text Topic Modeling via Self-Aggregation

A novel model integrating topic modeling with short text aggregation during topic inference is presented, founded on general topical affinity of texts rather than particular heuristics, making the model readily applicable to various short texts.

Extended Topic Model for Word Dependency

This paper proposes a new model Extended Global Topic Random Field (EGTRF) to model non-linear dependencies between words, parse sentences into dependency trees and represent them as a graph, and assume the topic assignment of a word is influenced by its adjacent words and distance-2 words.

Improving LDA topic models for microblogs via tweet pooling and automatic labeling

This paper empirically establishes that a novel method of tweet pooling by hashtags leads to a vast improvement in a variety of measures for topic coherence across three diverse Twitter datasets in comparison to an unmodified LDA baseline and a range of pooling schemes.

TM-LDA: efficient online modeling of latent topic transitions in social media

Temporal-LDA significantly outperforms state-of-the-art static LDA models for estimating the topic distribution of new documents over time and is able to highlight interesting variations of common topic transitions, such as the differences in the work-life rhythm of cities, and factors associated with area-specific problems and complaints.

Improving Topic Coherence with Regularized Topic Models

This work proposes two methods to regularize the learning of topic models by creating a structured prior over words that reflect broad patterns in the external data that make topic models more useful across a broader range of text data.

Mining topics in documents: standing on the shoulders of big data

This research proposes to learn as humans do, i.e., retaining the results learned in the past and using them to help future learning, and mines two forms of knowledge: must-link and cannot-link.