The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics

  title={The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics},
  author={Sebastian Gehrmann and Tosin P. Adewumi and Karmanya Aggarwal and Pawan Sasanka Ammanamanchi and Aremu Anuoluwapo and Antoine Bosselut and Khyathi Raghavi Chandu and Miruna Clinciu and Dipanjan Das and Kaustubh D. Dhole and Wanyu Du and Esin Durmus and Ondrej Dusek and Chris C. Emezue and Varun Gangal and Cristina Garbacea and Tatsunori B. Hashimoto and Yufang Hou and Yacine Jernite and Harsh Jhamtani and Yangfeng Ji and Shailza Jolly and Mihir Kale and Dhruv Kumar and Faisal Ladhak and Aman Madaan and Mounica Maddela and Khyati Mahajan and Saad Mahamood and Bodhisattwa Prasad Majumder and Pedro Henrique Martins and Angelina McMillan-Major and Simon Mille and Emiel van Miltenburg and Moin Nadeem and Shashi Narayan and Vitaly Nikolaev and Rubungo Andre Niyongabo and Salomey Osei and Ankur P. Parikh and Laura Perez-Beltrachini and Niranjan Rao and Vikas Raunak and Juan Diego Rodriguez and Sashank Santhanam and Jo{\~a}o Sedoc and Thibault Sellam and Samira Shaikh and Anastasia Shimorina and Marco Antonio Sobrevilla Cabezudo and Hendrik Strobelt and Nishant Subramani and Wei Xu and Diyi Yang and Akhila Yerukola and Jiawei Zhou},
We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics. Measuring progress in NLG relies on a constantly evolving ecosystem of automated metrics, datasets, and human evaluation standards. Due to this moving target, new models often still evaluate on divergent anglo-centric corpora with well-established, but flawed, metrics. This disconnect makes it challenging to identify the limitations of current models and opportunities for progress… 

Figures and Tables from this paper

Extract, Denoise, and Enforce: Evaluating and Predicting Lexical Constraints for Conditional Text Generation
This paper conducts extensive analytical experiments on a range of conditional generation tasks and proposes a framework for automatic constraint extraction, denoising, and enforcement that is shown to perform comparably or better than unconstrained generation.
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models
This work extensively analyzes different retrieval models and provides several suggestions that it believes may be useful for future work, finding that performing well consistently across all datasets is challenging.
The Human Evaluation Datasheet: A Template for Recording Details of Human Evaluation Experiments in NLP
This paper presents the Human Evaluation Datasheet (HEDS), a template for recording the details of individual human evaluation experiments in Natural Language Processing (NLP), and reports on first
Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages
Samanantar is presented, the largest publicly available parallel corpora collection for Indic languages and multilingual NMT models spanning all these languages are trained which outperform existing models and baselines on publicly available benchmarks, such as FLORES, establishing the utility of Samanantar.
Creativity and Machine Learning: A Survey
An overview of the history and the state of the art of computational creativity theories, key machine learning techniques (including generative deep learning), and corresponding automatic evaluation methods is presented.
Unifying Language Learning Paradigms
UL2 achieves SOTA performance on 50 well-established supervised NLP tasks ranging from language generation, language understanding, text classification, question answering, commonsense reasoning, long text reasoning, structured knowledge grounding and information retrieval.
Assessing the State of Self-Supervised Human Activity Recognition using Wearables
This paper assesses the progress of self-supervised HAR research by introducing a framework that performs a multi-faceted exploration of model performance, and organizes the framework into three dimensions, each containing three constituent criteria, and utilizes it to assess state-of-the-art self- supervised learning methods in a large empirical study on a curated set of nine diverse benchmarks.
Automatic Construction of Evaluation Suites for Natural Language Generation Datasets
A framework based on this idea which is able to generate controlled perturbations and identify subsets in text-to-scalar, text- to-text, or data-To-text settings is developed and applied to the GEM generation benchmark.
CoAuthor: Designing a Human-AI Collaborative Writing Dataset for Exploring Language Model Capabilities
It is argued that by curating and analyzing large interaction datasets, the HCI community can foster more incisive examinations of LMs’ generative capabilities, and presents CoAuthor, a dataset designed for revealing GPT-3’s capabilities in assisting creative and argumentative writing.
RealTime QA: What's the Answer Right Now?
We introduce R EAL T IME QA, a dynamic question answering (QA) platform that announces questions and evaluates systems on a regular basis (weekly in this version). R E AL T IME QA inquires about the


GLGE: A New General Language Generation Evaluation Benchmark
The General Language Generation Evaluation (GLGE), a new multi-task benchmark for evaluating the generalization capabilities of NLG models across eight language generation tasks, is presented and a leaderboard with strong baselines including MASS, BART, and ProphetNet is built.
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.
MLQA: Evaluating Cross-lingual Extractive Question Answering
This work presents MLQA, a multi-way aligned extractive QA evaluation benchmark intended to spur research in this area, and evaluates state-of-the-art cross-lingual models and machine-translation-based baselines onMLQA.
Unifying Human and Statistical Evaluation for Natural Language Generation
This paper proposes a unified framework which evaluates both diversity and quality, based on the optimal error rate of predicting whether a sentence is human- or machine-generated, called HUSE, which is efficiently estimated by combining human and statistical evaluation.
Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation
An up-to-date synthesis of research on the core tasks in NLG and the architectures adopted in which such tasks are organised is given, to highlight a number of recent research topics that have arisen partly as a result of growing synergies betweenNLG and other areas of artifical intelligence.
Evaluation of Text Generation: A Survey
This paper surveys evaluation methods of natural language generation (NLG) systems that have been developed in the last few years, with a focus on the evaluation of recently proposed NLG tasks and neural NLG models.
XGLUE: A New Benchmark Datasetfor Cross-lingual Pre-training, Understanding and Generation
A recent cross-lingual pre-trained model Unicoder is extended to cover both understanding and generation tasks, which is evaluated on XGLUE as a strong baseline and the base versions of Multilingual BERT, XLM and XLM-R are evaluated for comparison.
XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization
The Cross-lingual TRansfer Evaluation of Multilingual Encoders XTREME benchmark is introduced, a multi-task benchmark for evaluating the cross-lingually generalization capabilities of multilingual representations across 40 languages and 9 tasks.
GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation
This work introduces GENIE, an extensible human evaluation leaderboard, which brings the ease of leaderboards to text generation tasks and provides formal granular evaluation metrics and identifies areas for future research.