The Fewer Splits are Better: Deconstructing Readability in Sentence Splitting

  title={The Fewer Splits are Better: Deconstructing Readability in Sentence Splitting},
  author={Tadashi Nomoto},
In this work, we focus on sentence splitting, a subfield of text simplification, primarily motivated by an unproven idea that if you divide a sentence into pieces, it should become easier to understand. Our primary goal in this paper is to determine whether this is true. In particular, we ask, does it matter whether we break a sentence into two or three? We report on our findings based on Amazon Mechanical Turk. More specifically, we introduce a Bayesian modeling framework to further… 

Figures and Tables from this paper

BiSECT: Learning to Split and Rephrase Sentences with Bitexts

A novel dataset and a new model for this ‘split and rephrase’ task, which contains higher quality training examples than the previous Split and Rephrase corpora, and shows that models trained on BiSECT can perform a wider variety of split operations and improve upon previous state-of-the-art approaches in automatic and human evaluations.

MinWikiSplit: A Sentence Splitting Corpus with Minimal Propositions

A new sentence splitting corpus that is composed of 203K pairs of aligned complex source and simplified target sentences is compiled that is useful for developing sentence splitting approaches that learn how to transform sentences with a complex linguistic structure into a fine-grained representation of short sentences that present a simple and more regular structure.

Split and Rephrase

A new sentence simplification task (Split-and-Rephrase) where the aim is to split a complex sentence into a meaning preserving sequence of shorter sentences, which could be used as a preprocessing step which facilitates and improves the performance of parsers, semantic role labellers and machine translation systems.

Learning To Split and Rephrase From Wikipedia Edit History

It is shown that incorporating WikiSplit as training data produces a model with qualitatively better predictions that score 32 BLEU points above the prior best result on the WebSplit benchmark.

BLEU is Not Suitable for the Evaluation of Text Simplification

This paper manually compiled a sentence splitting gold standard corpus containing multiple structural paraphrases, and performed a correlation analysis with human judgments that found low or no correlation between BLEU and the grammaticality and meaning preservation parameters where sentence splitting is involved.

Text readability and intuitive simplification: A comparison of readability formulas

The results demonstrate that the Coh-Metrix L2 Reading Index performs significantly better than traditional readability formulas, suggesting that the variables used in this index are more closely aligned to the intuitive text processing employed by authors when simplifying texts.

Semantic Structural Evaluation for Text Simplification

This paper proposes the first measure to address structural aspects of text simplification, called SAMSA, which leverages recent advances in semantic parsing to assess simplification quality by decomposing the input based on its semantic structure and comparing it to the output.

Sentence Simplification with Deep Reinforcement Learning

This work addresses the simplification problem with an encoder-decoder model coupled with a deep reinforcement learning framework, and explores the space of possible simplifications while learning to optimize a reward function that encourages outputs which are simple, fluent, and preserve the meaning of the input.

Simplification or elaboration? The Effects of Two Types of Text Modifications on Foreign Language Reading Comprehension

The hypothesis that some elaborative modifications observed in oral foreigner talk discourse, where redundancy and explicitness compensate for unknown linguistic items, offer a potential alternative approach to written text modification was tested.

Creating Training Corpora for NLG Micro-Planners

This paper proposes the corpus generation framework as a novel method for creating challenging data sets from which NLG models can be learned which are capable of handling the complex interactions occurring during in micro-planning between lexicalisation, aggregation, surface realisation, referring expression generation and sentence segmentation.