BATS: A Spectral Biclustering Approach to Single Document Topic Modeling and Segmentation

  title={BATS: A Spectral Biclustering Approach to Single Document Topic Modeling and Segmentation},
  author={Sirui Wang and Yuwei Tu and Qiong Wu and Adam Hare and Zhenming Liu and Christopher G. Brinton and Yanhua Li},
Existing topic modeling and text segmentation methodologies generally require large datasets for training, limiting their capabilities when only small collections of text are available. In this work, we reexamine the inter-related problems of “topic identification” and “text segmentation” for sparse document learning, when there is a single new text of interest. In developing a methodology to handle single documents, we face two major challenges. First is sparse information : with access to… Expand

Figures and Tables from this paper

Equity2Vec: End-to-end Deep Learning Framework for Cross-sectional Asset Pricing
This work proposes an end-to-end deep learning framework to price assets that simultaneously leverages all the available heterogeneous alpha sources including technical indicators, financial news signals, and cross-sectional signals and monetizes the signals effectively. Expand
Rosella: A Self-Driving Distributed Scheduler for Heterogeneous Clusters
Rosella is presented, a new self-driving, distributed approach for task scheduling in heterogeneous clusters that provides high throughput and low latency simultaneously, because it runs in parallel on multiple machines with minimum coordination and only performs simple operations for each scheduling decision. Expand


Text segmentation: A topic modeling perspective
This paper investigates the use of two unsupervised topic models, latent Dirichlet allocation (LDA) and multinomial mixture, to segment a text into semantically coherent parts and suggests a modification to DP that dramatically speeds up the process with no loss in performance. Expand
Efficient methods for topic model inference on streaming document collections
Empirical results indicate that SparseLDA can be approximately 20 times faster than traditional LDA and provide twice the speedup of previously published fast sampling methods, while also using substantially less memory. Expand
A biterm topic model for short texts
The approach can discover more prominent and coherent topics, and significantly outperform baseline methods on several evaluation metrics, and is found that BTM can outperform LDA even on normal texts, showing the potential generality and wider usage of the new topic model. Expand
Statistical Models for Text Segmentation
Assessment of the approach on quantitative and qualitative grounds demonstrates its effectiveness in two very different domains, Wall Street Journal news articles and television broadcast news story transcripts, using a new probabilistically motivated error metric. Expand
Generative model-based document clustering: a comparative study
This paper presents a detailed empirical study of 12 generative approaches to text clustering, obtained by applying four types of document-to-cluster assignment strategies to each of three base models, namely mixtures of multivariate Bernoulli, multinomial, and von Mises-Fisher distributions. Expand
MetaLDA: A Topic Model that Efficiently Incorporates Meta Information
A topic model, called MetaLDA, which is able to leverage either document or word meta information, or both of them jointly, and which achieves comparable or improved performance in terms of both perplexity and topic quality, particularly in handling sparse texts. Expand
Optimizing Semantic Coherence in Topic Models
A novel statistical topic model based on an automated evaluation metric based on this metric that significantly improves topic quality in a large-scale document collection from the National Institutes of Health (NIH). Expand
Finding Semantically Valid and Relevant Topics by Association-Based Topic Selection Model
A novel personalized Association-based Topic Selection (ATS) model is developed, which can identify semantically valid and relevant topics from a set of raw topics based on the semantical relatedness between users’ preferences and the structured patterns captured in topics. Expand
Knowledge discovery through directed probabilistic topic models: a survey
This paper surveys an important subclass Directed Probabilistic Topic Models (DPTMs) with soft clustering abilities and their applications for knowledge discovery in text corpora, giving basic concepts, advantages and disadvantages in a chronological order. Expand
TopicTiling: A Text Segmentation Algorithm based on LDA
This work presents a Text Segmentation algorithm called TopicTiling, which is based on the well-known TextTiling algorithm, and segments documents using the Latent Dirichlet Allocation topic model, and is computationally less expensive than other LDA-based segmentation methods. Expand