Corpus ID: 235458516

Automatic Construction of Evaluation Suites for Natural Language Generation Datasets

  title={Automatic Construction of Evaluation Suites for Natural Language Generation Datasets},
  author={Simon Mille and Kaustubh D. Dhole and Saad Mahamood and Laura Perez-Beltrachini and Varun Gangal and Mihir Kale and Emiel van Miltenburg and Sebastian Gehrmann},
Machine learning approaches applied to NLP are often evaluated by summarizing their performance in a single number, for example accuracy. Since most test sets are constructed as an i.i.d. sample from the overall data, this approach overly simplifies the complexity of language and encourages overfitting to the head of the data distribution. As such, rare language phenomena or text about underrepresented groups are not equally included in the evaluation. To encourage more in-depth model analyses… Expand

Figures and Tables from this paper

Break, Perturb, Build: Automatic Perturbation of Reasoning Paths through Question Decomposition
This work introduces the “Break, Perturb, Build” (BPB) framework for automatic reasoning-oriented perturbation of questionanswer pairs, and demonstrates the effectiveness of BPB by creating evaluation sets for three reading comprehension benchmarks, generating thousands of high-quality examples without human intervention. Expand


GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks. Expand
A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference
The Multi-Genre Natural Language Inference corpus is introduced, a dataset designed for use in the development and evaluation of machine learning models for sentence understanding and shows that it represents a substantially more difficult task than does the Stanford NLI corpus. Expand
WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive Summarization
A method for direct crosslingual summarization without requiring translation at inference time is proposed by leveraging synthetic data and Neural Machine Translation as a pre-training step, which significantly outperforms the baseline approaches, while being more cost efficient during inference. Expand
Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference
There is substantial room for improvement in NLI systems, and the HANS dataset can motivate and measure progress in this area, which contains many examples where the heuristics fail. Expand
ASSET: A Dataset for Tuning and Evaluation of Sentence Simplification Models with Multiple Rewriting Transformations
ASSET is a crowdsourced multi-reference corpus where each simplification was produced by executing several rewriting transformations, and it is shown that simplifications in ASSET are better at capturing characteristics of simplicity when compared to other standard evaluation datasets for the task. Expand
How Can We Accelerate Progress Towards Human-like Linguistic Generalization?
This position paper describes and critiques the Pretraining-Agnostic Identically Distributed (PAID) evaluation paradigm, and advocates for supplementing or replacing PAID with paradigms that reward architectures that generalize as quickly and robustly as humans. Expand
KILT: a Benchmark for Knowledge Intensive Language Tasks
It is found that a shared dense vector index coupled with a seq2seq model is a strong baseline, outperforming more tailor-made approaches for fact checking, open-domain question answering and dialogue, and yielding competitive results on entity linking and slot filling, by generating disambiguated text. Expand
Beyond Accuracy: Behavioral Testing of NLP Models with CheckList
Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating modelsExpand
Dynabench: Rethinking Benchmarking in NLP
It is argued that Dynabench addresses a critical need in the community: contemporary models quickly achieve outstanding performance on benchmark tasks but nonetheless fail on simple challenge examples and falter in real-world scenarios. Expand
Analysing Data-To-Text Generation Benchmarks
This short paper proposes a methodology for analysing data-to-text corpora used for training Natural Language Generation (NLG) systems and applies this methodology to three existing benchmarks. Expand