BATS: A Spectral Biclustering Approach to Single Document Topic Modeling and Segmentation

  title={BATS: A Spectral Biclustering Approach to Single Document Topic Modeling and Segmentation},
  author={Sirui Wang and Yuwei Tu and Qiong Wu and Adam Hare and Zhenming Liu and Christopher G. Brinton and Yanhua Li},
  journal={ACM Transactions on Intelligent Systems and Technology (TIST)},
  pages={1 - 29}
  • Sirui Wang, Yuwei Tu, +4 authors Yanhua Li
  • Published 5 August 2020
  • Computer Science
  • ACM Transactions on Intelligent Systems and Technology (TIST)
Existing topic modeling and text segmentation methodologies generally require large datasets for training, limiting their capabilities when only small collections of text are available. In this work, we reexamine the inter-related problems of “topic identification” and “text segmentation” for sparse document learning, when there is a single new text of interest. In developing a methodology to handle single documents, we face two major challenges. First is sparse information: with access to only… 

Figures and Tables from this paper

Equity2Vec: End-to-end Deep Learning Framework for Cross-sectional Asset Pricing
This work proposes an end-to-end deep learning framework to price assets that simultaneously leverages all the available heterogeneous alpha sources including technical indicators, financial news signals, and cross-sectional signals and monetizes the signals effectively.
Representation Learning on Spatial Networks
Spatial networks are networks for which the nodes and edges are constrained by geometry and embedded in real space, which has crucial effects on their topological properties. Although tremendous
Rosella: A Self-Driving Distributed Scheduler for Heterogeneous Clusters
Rosella is presented, a new self-driving, distributed approach for task scheduling in heterogeneous clusters that provides high throughput and low latency simultaneously, because it runs in parallel on multiple machines with minimum coordination and only performs simple operations for each scheduling decision.


Text segmentation: A topic modeling perspective
This paper investigates the use of two unsupervised topic models, latent Dirichlet allocation (LDA) and multinomial mixture, to segment a text into semantically coherent parts and suggests a modification to DP that dramatically speeds up the process with no loss in performance.
Efficient methods for topic model inference on streaming document collections
Empirical results indicate that SparseLDA can be approximately 20 times faster than traditional LDA and provide twice the speedup of previously published fast sampling methods, while also using substantially less memory.
Topic Modeling over Short Texts by Incorporating Word Embeddings
A novel method, Embedding-based Topic Model (ETM), to learn latent topics from short texts that not only solves the problem of very limited word co-occurrence information by aggregating short texts into long pseudo- texts, but also utilizes a Markov Random Field regularized model that gives correlated words a better chance to be put into the same topic.
Short and Sparse Text Topic Modeling via Self-Aggregation
A novel model integrating topic modeling with short text aggregation during topic inference is presented, founded on general topical affinity of texts rather than particular heuristics, making the model readily applicable to various short texts.
A biterm topic model for short texts
The approach can discover more prominent and coherent topics, and significantly outperform baseline methods on several evaluation metrics, and is found that BTM can outperform LDA even on normal texts, showing the potential generality and wider usage of the new topic model.
SegBot: A Generic Neural Text Segmentation Model with Pointer Network
This work proposes a generic end-to-end segmentation model called SegBot, which outperforms state-of-the-art models on both topic and EDU segmentation tasks.
Statistical Models for Text Segmentation
Assessment of the approach on quantitative and qualitative grounds demonstrates its effectiveness in two very different domains, Wall Street Journal news articles and television broadcast news story transcripts, using a new probabilistically motivated error metric.
Generative model-based document clustering: a comparative study
This paper presents a detailed empirical study of 12 generative approaches to text clustering, obtained by applying four types of document-to-cluster assignment strategies to each of three base models, namely mixtures of multivariate Bernoulli, multinomial, and von Mises-Fisher distributions.
MetaLDA: A Topic Model that Efficiently Incorporates Meta Information
A topic model, called MetaLDA, which is able to leverage either document or word meta information, or both of them jointly, and which achieves comparable or improved performance in terms of both perplexity and topic quality, particularly in handling sparse texts.
Top2Vec: Distributed Representations of Topics
This model does not require stop-word lists, stemming or lemmatization, and it automatically finds the number of topics, and the resulting topic vectors are jointly embedded with the document and word vectors with distance between them representing semantic similarity.