GECToR – Grammatical Error Correction: Tag, Not Rewrite

  title={GECToR – Grammatical Error Correction: Tag, Not Rewrite},
  author={Kostiantyn Omelianchuk and Vitaliy Atrasevych and Artem N. Chernodub and Oleksandr Skurzhanskyi},
In this paper, we present a simple and efficient GEC sequence tagger using a Transformer encoder. Our system is pre-trained on synthetic data and then fine-tuned in two stages: first on errorful corpora, and second on a combination of errorful and error-free parallel corpora. We design custom token-level transformations to map input tokens to target corrections. Our best single-model/ensemble GEC tagger achieves an F_0.5 of 65.3/66.5 on CONLL-2014 (test) and F_0.5 of 72.4/73.6 on BEA-2019 (test… 

Improving Grammatical Error Correction for Multiword Expressions

Two systems which incorporate MWE information in two different ways are proposed: one is a multi-encoder decoder system which encodes MWE tags in a second encoder, and the other is a BART pre-trained transformer-based system that encode MWE representations using special tokens.

Stronger Baselines for Grammatical Error Correction Using a Pretrained Encoder-Decoder Model

The utility of bidirectional and auto-regressive transformers (BART) as a generic pretrained encoder-decoder model for GEC is explored and it is found that monolingual and multilingual BART models achieve high performance in GEC.

Data Weighted Training Strategies for Grammatical Error Correction

This work performs an empirical study to discover how to best incorporate delta-log-perplexity, a type of example scoring, into a training schedule for GEC, and performs experiments that shed light on the function and applicability of delta- Log-Perplexity.

Grammatical Error Correction with Denoising Autoencoder

This paper investigates the possibilities of using this model type for grammatical error correction and introduces a novel method of remark-based model checkpoint output combining and proves that an efficient model solving GEC might be trained in a matter of hours using a single GPU.

A Simple Recipe for Multilingual Grammatical Error Correction

It is demonstrated that performing a single fine-tuning step on cLang-8 with the off-the-shelf language models yields further accuracy improvements over an already top-performing gT5 model for English.

Hierarchical Character Tagger for Short Text Spelling Error Correction

A Hierarchical Character Tagger model, or HCTagger, is presented, which uses a pre-trained language model at the character level as a text encoder, and then predicts character-level edits to transform the original text into its error-free form with a much smaller label space.

Character Transformations for Non-Autoregressive GEC Tagging

A character-based non-autoregressive GEC approach, with automatically generated character transformations, training character transformation models for Czech, German and Russian, reaching solid results and dramatic speedup compared to autoregressive systems.

Ensembling and Knowledge Distilling of Large Sequence Taggers for Grammatical Error Correction

Improvements to the GEC sequence tagging architecture are investigated with a focus on ensembling of recent cutting-edge Transformer-based encoders in Large configurations and by majority votes on span-level edits.

Multi-Class Grammatical Error Detection for Correction: A Tale of Two Systems

A new state-of-the-art binary detection system based on pre-trained ELECTRA is developed, and extended to multi-class detection using different error type tagsets derived from the ERRANT framework, which outperforms all other previous work that combines GED and GEC and achieves a new single-model NMT-based state of the art on the BEA-test benchmark.

Grammatical Error Correction: More Data with More Context

A novel approach to equipping a GEC model with supplemental context and allowing it to glean grammatical information from a separate plain text corpus by using a parallel encoder to encode the cross-document context before fusing the two contexts of the encoders in the decoder.



Neural Grammatical Error Correction Systems with Unsupervised Pre-training on Synthetic Data

This work proposes a simple and surprisingly effective unsupervised synthetic error generation method based on confusion sets extracted from a spellchecker to increase the amount of training data.

The BEA-2019 Shared Task on Grammatical Error Correction

This paper reports on the BEA-2019 Shared Task on Grammatical Error Correction (GEC), which introduces a new dataset, the Write&Improve+LOCNESS corpus, which represents a wider range of native and learner English levels and abilities.

Improving Grammatical Error Correction via Pre-Training a Copy-Augmented Architecture with Unlabeled Data

This paper proposes a copy-augmented architecture for the GEC task by copying the unchanged words from the source sentence to the target sentence by fully pre-training a sequence to sequence model.

Grammatical error correction in non-native English

This thesis investigates GEC for learners of English as a Second Language (ESL) as a translation task from incorrect into correct English, explores new models for developing end-to-end GEC systems for all error types, study system performance for each error type, and examines model generali-sation to different corpora.

Automatic Annotation and Evaluation of Error Types for Grammatical Error Correction

ERRANT, a grammatical ERRor ANnotation Toolkit designed to automatically extract edits from parallel original and corrected sentences and classify them according to a new, dataset-agnostic, rule-based framework, which facilitates error type evaluation at different levels of granularity.

Learning to combine Grammatical Error Corrections

An automatic way to combine black-box systems that automatically detects the strength of a system or the combination of several systems per error type, improving precision and recall while optimizing F-score directly is proposed.

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, finds that it is possible to achieve comparable accuracy to direct subword training from raw sentences.

Parallel Iterative Edit Models for Local Sequence Transduction

Experiments on tasks spanning GEC, OCR correction and spell correction demonstrate that the PIE model is an accurate and significantly faster alternative for local sequence transduction.

Edinburgh Neural Machine Translation Systems for WMT 16

This work participated in the WMT 2016 shared news translation task by building neural translation systems for four language pairs, each trained in both directions, based on an attentional encoder-decoder, using BPE subword segmentation for open-vocabulary translation with a fixed vocabulary.

An Empirical Study of Incorporating Pseudo Data into Grammatical Error Correction

This study investigates how the pseudo data should be generated or used in the training of grammatical error correction models with state-of-the-art performance on the CoNLL-2014 test set and the official test set of the BEA-2019 shared task.