• Corpus ID: 233365337

Training and Domain Adaptation for Supervised Text Segmentation

  title={Training and Domain Adaptation for Supervised Text Segmentation},
  author={Goran Glavas and Ananya Ganesh and Swapna Somasundaran},
Unlike traditional unsupervised text segmentation methods, recent supervised segmentation models rely on Wikipedia as the source of large-scale segmentation supervision. These models have, however, predominantly been evaluated on the in-domain (Wikipedia-based) test sets, preventing conclusions about their general segmentation efficacy. In this work, we focus on the domain transfer performance of supervised neural text segmentation in the educational domain. To this end, we first introduce… 

Figures and Tables from this paper

Sustainable Modular Debiasing of Language Models
An extensive evaluation, encompassing three intrinsic and two extrinsic bias measures, renders A DELE very effective in bias mitigation, and it is shown that – due to its modular nature – ADELE retains fairness even after large-scale downstream training.


Two-Level Transformer and Auxiliary Coherence Modeling for Improved Text Segmentation
A novel supervised model for text segmentation with simple but explicit coherence modeling that couples the sentence-level segmentation objective with the coherence objective that differentiates correct sequences of sentences from corrupt ones and can successfully segment texts in languages unseen in training.
Text Segmentation as a Supervised Learning Task
This work forms text segmentation as a supervised learning problem, and presents a large new dataset for text segmentations that is automatically extracted and labeled from Wikipedia, and develops a segmentation model that generalizes well to unseen natural text.
Neural Text Segmentation and its Application to Sentiment Analysis
This work proposes a generic end-to-end segmentation model, namely <inline-formula><tex-math notation="LaTeX]," which first uses a bidirectional recurrent neural network to encode an input text sequence.
MultiCQA: Zero-Shot Transfer of Self-Supervised Text Matching Models on a Massive Scale
The best zero-shot transfer model considerably outperforms in-domain BERT and the previous state of the art on six benchmarks, and is proposed to incorporate self-supervised with supervised multi-task learning on all available source domains.
Statistical Models for Text Segmentation
Assessment of the approach on quantitative and qualitative grounds demonstrates its effectiveness in two very different domains, Wall Street Journal news articles and television broadcast news story transcripts, using a new probabilistically motivated error metric.
Applying Machine Learning to Text Segmentation for Information Retrieval
It is found that at around 70% word segmentation accuracy an over-segmentation phenomenon begins to occur which leads to a reduction in information retrieval performance, which suggests that words themselves might be too broad a notion to conveniently capture the general semantic meaning of Chinese text.
C-HTS: A Concept-based Hierarchical Text Segmentation approach
This paper proposes C-HTS, a Concept-based Hierarchical Text Segmentation approach that uses the semantic relatedness between text constituents, and uses the explicit semantic representation of text, automatically extracted from massive human knowledge repositories such as Wikipedia.
Exploring Influence of Topic Segmentation on Information Retrieval Quality
A search pipeline based on text segmentation by means of BigARTM tool and TopicTiling algorithm is proposed, which allows one to better model text structure and therefore language itself, which influences the quality of text representation.
TopicTiling: A Text Segmentation Algorithm based on LDA
This work presents a Text Segmentation algorithm called TopicTiling, which is based on the well-known TextTiling algorithm, and segments documents using the Latent Dirichlet Allocation topic model, and is computationally less expensive than other LDA-based segmentation methods.
Hierarchical Text Segmentation from Multi-Scale Lexical Cohesion
This paper presents a novel unsupervised method for hierarchical topic segmentation that takes the form of a coordinate-ascent algorithm, iterating between two steps: a novel dynamic program for obtaining the globally-optimal hierarchical segmentation, and collapsed variational Bayesian inference over the hidden variables.