Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand

  title={Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand},
  author={Jungo Kasai and Keisuke Sakaguchi and Ronan Le Bras and Lavinia Dunagan and Jacob Daniel Morrison and Alexander R. Fabbri and Yejin Choi and Noah A. Smith},
  booktitle={North American Chapter of the Association for Computational Linguistics},
Natural language processing researchers have identified limitations of evaluation methodology for generation tasks, with new questions raised about the validity of automatic metrics and of crowdworker judgments. Meanwhile, efforts to improve generation models tend to depend on simple n-gram overlap metrics (e.g., BLEU, ROUGE). We argue that new advances on models and metrics should each more directly benefit and inform the other. We therefore propose a generalization of leaderboards… 

GENIE: Toward Reproducible and Standardized Human Evaluation for Text Generation

This work considers design choices for the annotation interface used to elicit human judgments and their impact on reproducibility, and develops an automated mechanism for maintaining annotator quality via a probabilistic model that detects and excludes noisy annotators.

Towards a Unified Multi-Dimensional Evaluator for Text Generation

This paper re-frame NLG evaluation as a Boolean Question Answering (QA) task, and by guiding the model with different questions, they can use one evaluator to evaluate from multiple dimensions, and introduces an intermediate learning phase that enables UniEval to incorporate external knowledge from multiple related tasks and gain further improvement.

Near-Negative Distinction: Giving a Second Life to Human Evaluation Datasets

This paper proposes a new and simple automatic evaluation method for NLG called Near-Negative Distinction (NND) that repurposes prior human annotations into NND tests that can give a second life to human annotations and provide low-cost NLG evaluation.

DEMETR: Diagnosing Evaluation Metrics for Translation

DEMETR is a diagnostic dataset with 31K English examples for evaluating the sensitivity of MT evaluation metrics to 35 different linguistic perturbations spanning semantic, syntactic, and morphological error categories and it is found that learned metrics perform substantially better than string-based metrics on DEMETR.

GEMv2: Multilingual NLG Benchmarking in a Single Line of Code

The new version of the Generation, Evaluation, and Metrics Benchmark introduces GEMv2, which introduces a modular infrastructure for dataset, model, and metric developers to benefit from each others work.

Twist Decoding: Diverse Generators Guide Each Other

This work introduces Twist decoding, a simple and general text generation algorithm that benefits from diverse models at inference time and consistently outperforms the popular reranking heuristic where output candidates from one model are rescored by another.

A global analysis of metrics used for measuring performance in natural language processing

The results suggest that the large majority of natural language processing metrics currently used have properties that may result in an inadequate reflection of a models’ performance, and ambiguities and inconsistencies in the reporting of metrics may lead to difficulties in interpreting and comparing model performances, impairing transparency and reproducibility in NLP research.

Transparent Human Evaluation for Image Captioning

THumB, a rubric-based human evaluation protocol for image captioning models, is established and results reveal that CLIPScore, a recent metric that uses image features, better correlates with human judgments than conventional text-only metrics because it is more sensitive to recall.

On the Blind Spots of Model-Based Evaluation Metrics for Text Generation

This work design and synthesize a wide range of potential errors and check whether they result in a drop in the metric scores, and investigates the reasons behind these blind spots and suggests practical workarounds for a more reliable evaluation of text generation.

Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation

A modified summarization salience protocol, Atomic Content Units (ACUs), which relies onained semantic units and al-lows for high inter-annotator agreement is proposed, which has important implications for evaluating large language models (LLMs), as it shows that LLMs adjusted by human feedback may over-strained human evaluation.



GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation

This work introduces GENIE, an extensible human evaluation leaderboard, which brings the ease of leaderboards to text generation tasks and provides formal granular evaluation metrics and identifies areas for future research.

Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation

This work proposes an evaluation methodology grounded in explicit error analysis, based on the Multidimensional Quality Metrics (MQM) framework, and carries out the largest MQM research study to date, scoring the outputs of top systems from the WMT 2020 shared task in two language pairs using annotations provided by professional translators with full document context.

SummEval: Re-evaluating Summarization Evaluation

This work re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion using neural summarization model outputs along with expert and crowd-sourced human annotations and implements and shares a toolkit that provides an extensible and unified API for evaluating summarization models across a broad range of automatic metrics.

Sentence Mover’s Similarity: Automatic Evaluation for Multi-Sentence Texts

This work introduces methods based on sentence mover’s similarity, and finds that sentence-based metrics correlate with human judgments significantly better than ROUGE, both on machine-generated summaries and human-authored essays.

Answers Unite! Unsupervised Metrics for Reinforced Summarization Models

This work explores and proposes alternative evaluation measures and reports that the reported human-evaluation analysis shows that the proposed metrics, based on Question Answering, favorably compare to ROUGE – with the additional property of not requiring reference summaries.

SPICE: Semantic Propositional Image Caption Evaluation

There is considerable interest in the task of automatically generating image captions. However, evaluation is challenging. Existing automatic evaluation metrics are primarily sensitive to n-gram

BLEURT: Learning Robust Metrics for Text Generation

BLEURT, a learned evaluation metric for English based on BERT, can model human judgment with a few thousand possibly biased training examples and yields superior results even when the training data is scarce and out-of-distribution.

The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics

GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics, is introduced and the description of the data for the 2021 shared task at the associated GEM Workshop is described.

Fine-Tuning Language Models from Human Preferences

This paper builds on advances in generative pretraining of language models to apply reward learning to four natural language tasks: continuing text with positive sentiment or physically descriptive language, and summarization tasks on the TL;DR and CNN/Daily Mail datasets.

BLEU Might Be Guilty but References Are Not Innocent

This paper develops a paraphrasing task for linguists to perform on existing reference translations, which counteracts this bias and reveals that multi-reference BLEU does not improve the correlation for high quality output, and presents an alternative multi- reference formulation that is more effective.