GEMv2: Multilingual NLG Benchmarking in a Single Line of Code

  title={GEMv2: Multilingual NLG Benchmarking in a Single Line of Code},
  author={Sebastian Gehrmann and Abhik Bhattacharjee and Abinaya Mahendiran and Alex Wang and Alexandros Papangelis and Aman Madaan and Angelina McMillan-Major and Anna Shvets and Ashish Upadhyay and Bingsheng Yao and Bryan Wilie and Chandra Bhagavatula and Chaobin You and Craig Thomson and Cristina Garbacea and Dakuo Wang and Daniel Deutsch and Deyi Xiong and Di Jin and Dimitra Gkatzia and Dragomir R. Radev and Elizabeth Clark and Esin Durmus and Faisal Ladhak and Filip Ginter and Genta Indra Winata and Hendrik Strobelt and Hiroaki Hayashi and Jekaterina Novikova and Jenna Kanerva and Jenny Chim and Jiawei Zhou and Jordan Clive and Joshua Maynez and Jo{\~a}o Sedoc and Juraj Juraska and Kaustubh D. Dhole and Khyathi Raghavi Chandu and Leonardo F. R. Ribeiro and Lewis Tunstall and Li Zhang and Mahima Pushkarna and Mathias Creutz and Michael White and Mihir Kale and Moussa Kamal Eddine and Nico Daheim and Nishant Subramani and Ondrej Dusek and Paul Pu Liang and Pawan Sasanka Ammanamanchi and Qinqin Zhu and Ratish Puduppully and Reno Kriz and Rifat Shahriyar and Ronald Cardenas and Saad Mahamood and Salomey Osei and Samuel Cahyawijaya and Sanja vStajner and S{\'e}bastien Montella and Shailza and Shailza Jolly and Simon Mille and Tahmid Hasan and Tianhao Shen and Tosin P. Adewumi and Vikas Raunak and Vipul Raheja and Vitaly Nikolaev and Vivian Tsai and Yacine Jernite and Yi Xu and Yisi Sang and Yixin Liu and Yufang Hou},
Evaluation in machine learning is usually in-formed by past choices, for example which datasets or metrics to use. This standardiza-tion enables the comparison on equal footing using leaderboards, but the evaluation choices become sub-optimal as better alternatives arise. This problem is especially pertinent in natural language generation which requires ever-improving suites of datasets, metrics, and human evaluation to make definitive claims. To make following best model evaluation practices… 
12 Citations

Figures and Tables from this paper

Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation

A modified summarization salience protocol, Atomic Content Units (ACUs), which relies onained semantic units and al-lows for high inter-annotator agreement is proposed, which has important implications for evaluating large language models (LLMs), as it shows that LLMs adjusted by human feedback may over-strained human evaluation.

NusaCrowd: Open Source Initiative for Indonesian NLP Resources

NusaCrowd’s data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and its local languages, including opening access to previously non-public resources.

Operationalizing Specifications, In Addition to Test Sets for Evaluating Constrained Generative Models

It is argued that the scale of generative models could be exploited to raise the abstraction level at which evaluation itself is conducted and provided recommendations, based on leveraging specifications as a powerful instrument to evaluate generation quality.

RealTime QA: What's the Answer Right Now?

It is found that GPT-3 tends to return outdated answers when retrieved documents do not provide sufficient information to provide an answer, suggesting an important avenue for future research: can an open domain QA system iden-tify such unanswerable cases and communi-cate with the user or even the retrieval module to modify the retrieval results?

Petals: Collaborative Inference and Fine-tuning of Large Models

P ETALS 1 is proposed — a system for inference and training andtuning of large models collaboratively by join-ing the resources of multiple parties trusted to process client’s data by demonstrating that this strategy outperforms offloading for very large models.

A Major Obstacle for NLP Research: Let’s Talk about Time Allocation!

This paper demonstrates that, in recent years, **subpar time allocation has been a major obstacle for NLP research** and proposes remedies to improve the status quo.

CiteBench: A benchmark for Scientific Citation Text Generation

This paper proposes C ITE B ENCH : a benchmark for citation text generation that unifies the previous datasets and enables standardized evaluation of citation textgeneration models across task settings and domains.

Evaluating Human-Language Model Interaction

A framework, Human-AI Language-based Interaction Evaluation (H-LINE), is developed that expands non-interactive evaluation along three dimensions, capturing the interactive process, not only the output of the system, and notions of preference beyond quality.

BanglaParaphrase: A High-Quality Bangla Paraphrase Dataset

This work presents BanglaParaphrase, a high-quality synthetic Bangla Paraphrase dataset curated by a novel filtering pipeline, which ensures quality by preserving both semantics and diversity, making it particularly useful to enhance other Bangla datasets.

Evaluation for Change

Evaluation is the central means for assessing, understanding, and communicating about NLP models. In this position paper, we argue evaluation should be more than that: it is a force for driving



The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics

GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics, is introduced and the description of the data for the 2021 shared task at the associated GEM Workshop is described.

Automatic Construction of Evaluation Suites for Natural Language Generation Datasets

A framework based on this idea which is able to generate controlled perturbations and identify subsets in text-to-scalar, text- to-text, or data-To-text settings is developed and applied to the GEM generation benchmark.

WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive Summarization

A method for direct crosslingual summarization without requiring translation at inference time is proposed by leveraging synthetic data and Neural Machine Translation as a pre-training step, which significantly outperforms the baseline approaches, while being more cost efficient during inference.

BLEURT: Learning Robust Metrics for Text Generation

BLEURT, a learned evaluation metric for English based on BERT, can model human judgment with a few thousand possibly biased training examples and yields superior results even when the training data is scarce and out-of-distribution.

Small but Mighty: New Benchmarks for Split and Rephrase

It is found that the widely used benchmark dataset universally contains easily exploitable syntactic cues caused by its automatic generation process, and it is shown that even a simple rule-based model can perform on par with the state-of-the-art model.

NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation

NL-Augmenter is presented, a new participatory Python-based natural language augmentation framework which supports the creation of both transformations and data splits according to specific features and demonstrates the robustness of popular natural language models using several of its tranformations.

Creating Training Corpora for NLG Micro-Planners

This paper proposes the corpus generation framework as a novel method for creating challenging data sets from which NLG models can be learned which are capable of handling the complex interactions occurring during in micro-planning between lexicalisation, aggregation, surface realisation, referring expression generation and sentence segmentation.

XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages

XLSum, a comprehensive and diverse dataset comprising 1 million professionally annotated article-summary pairs from BBC, extracted using a set of carefully designed heuristics, is presented, which is the largest abstractive summarization dataset in terms of the number of samples collected from a single source and thenumber of languages covered.

ASSET: A Dataset for Tuning and Evaluation of Sentence Simplification Models with Multiple Rewriting Transformations

ASSET is a crowdsourced multi-reference corpus where each simplification was produced by executing several rewriting transformations, and it is shown that simplifications in ASSET are better at capturing characteristics of simplicity when compared to other standard evaluation datasets for the task.

Naver Labs Europe’s Systems for the Document-Level Generation and Translation Task at WNGT 2019

This work proposes to leverage data from both machine translation and natural language generation tasks and do transfer learning between MT, NLG, and MT with source-side metadata (MT+NLG), and outperforms the previous state of the art on the Rotowire NLG task.