Topic Discovery in Massive Text Corpora Based on Min-Hashing

  title={Topic Discovery in Massive Text Corpora Based on Min-Hashing},
  author={Gibran Fuentes-Pineda and Ivan Vladimir Meza Ruiz},
Abstract Topics have proved to be a valuable source of information for exploring, discovering, searching and representing the contents of text corpora. They have also been useful for different natural language processing tasks such as text classification, text summarization and machine translation. Most existing topic discovery approaches require the number of topics to be provided beforehand. However, an appropriate number of topics for a given corpus depends on its characteristics and is… Expand
Unstructured Text Documents Summarization With Multi-Stage Clustering
The proposed dynamic corpus creation mechanism combines metadata with summarized extracted text and provides the systematic query-relevant corpus processing mechanism, which automatically selects the most relevant sub-corpus through dynamic path selection. Expand
A Framework for Detecting Key Topics in Social Networks
This paper introduces the Word Segment Merging (WSM) method to identify new phrases in short texts and represent a document with the vector space model (VSM), and model the life cycle of topics for clustering and popularity computing. Expand
An Integrated Classification Model for Massive Short Texts with Few Words
An integrated classification model is introduced to train the word vectors of massive short texts with few words to form the feature space, and then the vector representation of each instance in texts is trained based on sentence embedding. Expand
DistSNNMF: Solving Large-Scale Semantic Topic Model Problems on HPC for Streaming Texts
A distributed version of the prior topic modeling algorithm (SNNMF) nameddistributed across multiple worker nodes, such that the whole training process is accelerated through the cooperation with the data-parallel platform. Expand
State of the Art Models for Fake News Detection Tasks
This paper presents state of the art methods for addressing three important challenges in automated fake news detection: fake news detection, domain identification, and bot identification in tweets.Expand


A Scalable Asynchronous Distributed Algorithm for Topic Modeling
It is shown that F+Nomad LDA significantly outperforms recent state-of-the-art topic modeling approaches on massive problems which involve millions of documents, billions of words, and thousands of topics. Expand
Sampled Weighted Min-Hashing for Large-Scale Topic Mining
Samples Weighted Min-Hashing generates multiple random partitions of the corpus vocabulary based on term cooccurrence and agglomerates highly overlapping inter-partition cells to produce the mined topics. Expand
How Many Topics? Stability Analysis for Topic Models
Using a topic modeling approach based on matrix factorization, evaluations performed on a range of corpora show that this strategy can successfully guide the model selection process. Expand
LightLDA: Big Topic Models on Modest Computer Clusters
A new, highly-efficient O(1) Metropolis-Hastings sampling algorithm, whose running cost is (surprisingly) agnostic of model size, and empirically converges nearly an order of magnitude more quickly than current state-of-the-art Gibbs samplers; a model-scheduling scheme to handle the big model challenge, where each worker machine schedules the fetch/use of sub-models as needed; and a differential data-structure for model storage. Expand
Word Counts and Topic Models
It is shown that automated methods have different strengths that provide different opportunities, enriching—but not replacing—the range of manual content analysis methods. Expand
Statistical topic models for multi-label document classification
The experimental results indicate that probabilistic generative models can achieve competitive multi-label classification performance compared to discriminative methods, and have advantages for datasets with many labels and skewed label frequencies. Expand
Latent association rule cluster based model to extract topics for classification and recommendation applications
The LARCM is a non-probabilistic topic model that makes use of association rule clustering to build a document representation with low dimensionality in such a way that each feature is comprised of information concerning relations among the terms. Expand
Polylingual Topic Models
This work introduces a polylingual topic model that discovers topics aligned across multiple languages and demonstrates its usefulness in supporting machine translation and tracking topic trends across languages. Expand
TweetLDA: supervised topic classification and link prediction in Twitter
It is found that L-LDA generally performs as well as SVM, and it clearly outperforms SVM when training data is limited, making it an ideal classification technique for infrequent topics and for (short) profiles of moderately active users. Expand
Reducing the sampling complexity of topic models
An algorithm which scales linearly with the number of actually instantiated topics kd in the document, for large document collections and in structured hierarchical models kd ll k, yields an order of magnitude speedup. Expand