Problems in Current Text Simplification Research: New Data Can Help

  title={Problems in Current Text Simplification Research: New Data Can Help},
  author={Wei Xu and Chris Callison-Burch and Courtney Napoles},
  journal={Transactions of the Association for Computational Linguistics},
Simple Wikipedia has dominated simplification research in the past 5 years. In this opinion paper, we argue that focusing on Wikipedia limits simplification research. We back up our arguments with corpus analysis and by highlighting statements that other researchers have made in the simplification literature. We introduce a new simplification dataset that is a significant improvement over Simple Wikipedia, and present a novel quantitative-comparative approach to study the quality of… 

The Role of Text Simplification Operations in Evaluation

An operation-based investigation is performed, demonstrating in detail the limitations of existing simplification datasets, and recommendations for future standardised practices in the design, creation and evaluation of TS resources are made.

Lexical Simplification with Neural Ranking

A new Lexical Simplification approach that exploits Neural Networks to learn substitutions from the Newsela corpus - a large set of professionally produced simplifications that leads to the highest Accuracy, Precision and F1 scores to date in standard datasets for the task.

Semantic Structural Evaluation for Text Simplification

This paper proposes the first measure to address structural aspects of text simplification, called SAMSA, which leverages recent advances in semantic parsing to assess simplification quality by decomposing the input based on its semantic structure and comparing it to the output.

Improving Human Text Simplification with Sentence Fusion

A graph-based sentence fusion approach to augment human simplification and a reranking approach to both select high quality simplifications and to allow for targeting simplifications with varying levels of simplicity are introduced.

Elaborative Simplification: Content Addition and Explanation Generation in Text Simplification

This work introduces a new annotated dataset of 1.3K instances of elaborative simplification and analyzes how entities, ideas, and concepts are elaborated through the lens of contextual specificity, and establishes baselines for elaboration generation using large scale pre-trained language models.

New Data is Indeed Helping Lexical Simplification

We propose the use of the Newsela corpus for Complex Word Identification, a sub-problem of Lexical Simplification and conduct an empirical evaluation by comparing it with benchmark corpora previously

Learning How to Simplify From Explicit Labeling of Complex-Simplified Text Pairs

A way to automatically identify operations in a parallel corpus and introduce a sequence-labeling approach based on these annotations is devised, which provides insights on the types of transformations that different approaches can model.

Document-Level Text Simplification: Dataset, Criteria and Baseline

This paper defines and investigates a new task of document-level text simplification, which aims to simplify a document consisting of multiple sentences, and proposes a new automatic evaluation metric called D-SARI that is more suitable for the document- level simplification task.

Simple-QE: Better Automatic Quality Estimation for Text Simplification

This work proposes SimpleQE, a BERT-based quality estimation (QE) model adapted from prior summarization QE work, and shows that it correlates well with human quality judgments.

Text Simplification from Professionally Produced Corpora

This work investigates the application of the recently created Newsela corpus, the largest collection of professionally written simplifications available, in TS tasks, and shows that the corpus can be used to learn sentence simplification patterns in more effective ways than corpora used in previous work.



Simple English Wikipedia: A New Text Simplification Task

A new data set is introduced that pairs English Wikipedia with Simple English Wikipedia and is orders of magnitude larger than any previously examined for sentence simplification and contains the full range of simplification operations including rewording, reordering, insertion and deletion.

Text Simplification for Information-Seeking Applications

The notion of Easy Access Sentence is defined – a unit of text from which the information it contains can be retrieved by a system with modest text-analysis capabilities, able to process single verb sentences with named entities as constituents.

Aligning Sentences from Standard Wikipedia to Simple Wikipedia

This work improves monolingual sentence alignment for text simplification, specifically for text in standard and simple Wikipedia by using a greedy search over the document and a word-level semantic similarity score based on Wiktionary that also accounts for structural similarity through syntactic dependencies.

Syntactic Simplification for Improving Content Selection in Multi-Document Summarization

It is shown how simplifying parentheticals by removing relative clauses and appositives results in improved sentence clustering, by forcing clustering based on central rather than background information.

WikiSimple: Automatic Simplification of Wikipedia Articles

A model that simplifies documents automatically while selecting their most important content and rewriting them in a simpler style is proposed, which significantly reduces the reading difficulty, while still capturing the important content.

Generating Anaphora for Simplifying Text

An algorithm for generating referring expressions in open domains that relies on WordNet synonym and antonym sets and is believed to be the first algorithm that allows for the incremental incorporation of relations.

A survey of research on text simplification

The goal of this paper is to summarise the large interdisciplinary body of work on text simplification and highlight the most promising research directions to move the field forward.

A Monolingual Tree-based Translation Model for Sentence Simplification

A Tree-based Simplification Model (TSM) is proposed, which, to the knowledge, is the first statistical simplification model covering splitting, dropping, reordering and substitution integrally.

Improving Text Simplification Language Modeling Using Unsimplified Text Data

This paper examines language modeling for text simplification and finds that a combined model using both simplified and normal English data achieves a 23% improvement in perplexity and a 24% improvement on the lexical simplification task over a model trained only on simple data.

Collecting Highly Parallel Data for Paraphrase Evaluation

A novel data collection framework is presented that produces highly parallel text data relatively inexpensively and on a large scale that allows for simple n-gram comparisons to measure both the semantic adequacy and lexical dissimilarity of paraphrase candidates.