• Corpus ID: 9128245

Simple English Wikipedia: A New Text Simplification Task

  title={Simple English Wikipedia: A New Text Simplification Task},
  author={William Coster and David Kauchak},
  booktitle={Annual Meeting of the Association for Computational Linguistics},
In this paper we examine the task of sentence simplification which aims to reduce the reading complexity of a sentence by incorporating more accessible vocabulary and sentence structure. [] Key Result We provide an analysis of this corpus as well as preliminary results using a phrase-based translation approach for simplification.

Figures and Tables from this paper

Learning to Simplify Sentences Using Wikipedia

A new translation model for text simplification is introduced that extends a phrase-based machine translation approach to include phrasal deletion in a corpus of 137K aligned sentence pairs extracted by aligning English Wikipedia and Simple English Wikipedia.

SimpLe: Lexical Simplification using Word Sense Disambiguation

This chapter examines the process of lexical substitution and particularly the role that word sense disambiguation plays in this task, and provides empirical results which show that the method creates simplifications that significantly reduce the reading difficulty of the input text while maintaining its grammaticality and preserving its meaning.

Japanese sentence compression using Simple English Wikipedia

This work manually explored the correspondences between the articles of Japanese Wikipedia and those of Simple English Wikipedia and then proposed a cross-lingual alignment method using simple matching algorithm.

Aligning Sentences from Standard Wikipedia to Simple Wikipedia

This work improves monolingual sentence alignment for text simplification, specifically for text in standard and simple Wikipedia by using a greedy search over the document and a word-level semantic similarity score based on Wiktionary that also accounts for structural similarity through syntactic dependencies.

Learning a Lexical Simplifier Using Wikipedia

This paper extracts over 30K candidate lexical simplifications by identifying aligned words in a sentencealigned corpus of English Wikipedia with Simple English Wikipedia using a feature-based ranker trained on a set of labeled simplifications collected using Amazon’s Mechanical Turk.

Sentence Simplification using Syntactic Parse trees

A classical approach consisting of two separate algorithms, for simplification of complex and compound sentences to their corresponding simple forms is presented.

Is Simple English Wikipedia As Simple And Easy-to-Understand As We Expect It To Be?

This study analyzes and compares two widely used English text simplification corpora, one professionally produced (Newsela) and the other collaboratively made by amateurs and enthusiasts (English Wikipedia–Simple English Wikipedia), focusing on 19 conceptual complexity features and indicates that simplification operations made during the production of Simple English Wikipedia in many cases do not follow the patterns of the professionally simplified corpora.

Optimizing Statistical Machine Translation for Text Simplification

This work is the first to design automatic metrics that are effective for tuning and evaluating simplification systems, which will facilitate iterative development for this task.

Developing a Monolingual Sentence Simplification Corpus for Urdu

A lexical and syntactically simplified Urdu simplification corpus is presented and a detailed analysis of the various simplification operations are presented to help start readability and automatic sentence simplification research.

Text Simplification without Simplified Corpora

This research proposes text simplification methods by lexical substitution approach and monolingual translation approach for languages that cannot use large-scale simplified corpora, especially Japanese, and proposes novel paraphrase acquisition, meaning preservation filtering, simplicity filtering, and grammaticality ranking methods for Japanese.



For the sake of simplicity: Unsupervised extraction of lexical simplifications from Wikipedia

This work considers two main approaches to deriving simplification probabilities via an edit model that accounts for a mixture of different operations, and using metadata to focus on edits that are more likely to be simplification operations.

Automatic induction of rules for text simplification

Learning Simple Wikipedia: A Cogitation in Ascertaining Abecedarian Language

The potential of Simple Wikipedia to assist automatic text simplification by building a statistical classification system that discriminates simple English from ordinary English is investigated and can be applied as a tool to help writers craft simple text.

Mining Wikipedia Revision Histories for Improving Sentence Compression

This work proposes a novel lexicalized noisy channel model for sentence compression, achieving improved results in grammaticality and compression rate criteria with a slight decrease in importance.

Sentence Alignment for Monolingual Comparable Corpora

This work addresses the problem of sentence alignment for monolingual corpora by incorporating context into the search for an optimal alignment in two complementary ways: learning rules for matching paragraphs using topic structure and refining the matching through local alignment to find good sentence pairs.

Towards Robust Context-Sensitive Sentence Alignment for Monolingual Corpora

A new monolingual sentence alignment algorithm is presented, combining a sentence-based TF*IDF score, turned into a probability distribution using logistic regression, with a global alignment dynamic programming algorithm, achieving a substantial improvement in accuracy over existing systems.

Models for Sentence Compression: A Comparison across Domains, Training Requirements and Evaluation Measures

This paper provides a novel comparison between a supervised constituent-based and an weakly supervised word-based compression algorithm and examines how these models port to different domains (written vs. spoken text).

A Generic Sentence Trimmer with CRFs

The paper presents a novel sentence trimmer in Japanese, which combines a non-statistical yet generic tree generation model and Conditional Random Fields (CRFs), to address improving the

Summarization beyond sentence extraction: A probabilistic approach to sentence compression

Sentence Simplification for Semantic Role Labeling

A general method for learning how to iteratively simplify a sentence, thus decomposing complicated syntax into small, easy-to-process pieces and achieving near-state-of-the-art performance across syntactic variation.