• Corpus ID: 231632593

GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation

@article{Khashabi2021GENIEAL,
  title={GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation},
  author={Daniel Khashabi and Gabriel Stanovsky and Jonathan Bragg and Nicholas Lourie and Jungo Kasai and Yejin Choi and Noah A. Smith and Daniel S. Weld},
  journal={ArXiv},
  year={2021},
  volume={abs/2101.06561}
}
Leaderboards have eased model development for many NLP datasets by standardizing their evaluation and delegating it to an independent external repository. Their adoption, however, is so far limited to tasks which can be reliably evaluated in an automatic manner. This work introduces GENIE, an extensible human evaluation leaderboard, which brings the ease of leaderboards to text generation tasks. GENIE automatically posts leaderboard submissions to crowdsourcing platforms asking human annotators… 

Figures and Tables from this paper

The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics
TLDR
GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics, is introduced and the description of the data for the 2021 shared task at the associated GEM Workshop is described.
ExplainaBoard: An Explainable Leaderboard for NLP
TLDR
A new conceptualization and implementation of NLP evaluation: the ExplainaBoard, which in addition to inheriting the functionality of the standard leaderboard, also allows researchers to diagnose strengths and weaknesses of a single system and interpret relationships between multiple systems.
SummVis: Interactive Visual Analysis of Models, Data, and Evaluation for Text Summarization
TLDR
SummVis, an open-source tool for visualizing abstractive summaries that enables fine-grained analysis of the models, data, and evaluation metrics associated with text summarization, is introduced.
Scarecrow: A Framework for Scrutinizing Machine Text
TLDR
Humans have moreulty spotting errors in higher quality text; accounting for this difference dramatically increases the gap between model-authored and human-authored text.
All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text
TLDR
The role untrained human evaluations play in NLG evaluation is examined and three approaches for quickly training evaluators to better identify GPT3-authored text are explored and it is found that while evaluation accuracy improved up to 55%, it did not significantly improve across the three domains.
TellMeWhy: A Dataset for Answering Why-Questions in Narratives
TLDR
This work introduces TellMeWhy, a new crowd-sourced dataset that consists of more than 30k questions and free-form answers concerning why characters in short narratives perform the actions described, and shows that state-of-the-art models are far below human performance on answering such questions.
Zero-Shot Controlled Generation with Encoder-Decoder Transformers
TLDR
This work proposes novel approaches for controlling encoder-decoder transformerbased NLG models in zero-shot by introducing three control knobs, namely, attention biasing, decoder mixing, and context augmentation, that are applied to these models at generation time and shows that not only are theseNLG models robust to such manipulations, but also their behavior could be controlled without an impact on their generation performance.
GooAQ: Open Question Answering with Diverse Answer Types
TLDR
GOOAQ is presented, a large-scale dataset collected from Google questions and answers, containing 3 million questions with diverse answer types ranging from factual short answers to snippets to collections, and it is shown that 94% of the mined answers are accurate, enabling fine-tuning a pre-trained language model for answering GOOAq questions.
Think you have Solved Direct-Answer Question Answering? Try ARC-DA, the Direct-Answer AI2 Reasoning Challenge
TLDR
The ARC-DA dataset is presented, a direct-answer (“open response”, “freeform”) version of the ARC (AI2 Reasoning Challenge) multiple-choice dataset, one of the first DA datasets of natural questions that often require reasoning, and where appropriate question decompositions are not evident from the questions themselves.
Methods for the Design and Evaluation of HCI+NLP Systems
TLDR
Five methodological proposals at the intersection of HCI and NLP are presented and situate them in the context of ML-based NLP models.
...
...

References

SHOWING 1-10 OF 67 REFERENCES
Curious Case of Language Generation Evaluation Metrics: A Cautionary Tale
TLDR
These experiments show that metrics usually prefer system outputs to human-authored texts, can be insensitive to correct translations of rare words, and can yield surprisingly high scores when given a single sentence as system output for the entire test set.
Evaluation of Text Generation: A Survey
TLDR
This paper surveys evaluation methods of natural language generation (NLG) systems that have been developed in the last few years, with a focus on the evaluation of recently proposed NLG tasks and neural NLG models.
HighRES: Highlight-based Reference-less Evaluation of Summarization
TLDR
A novel approach for manual evaluation, Highlight-based Reference-less Evaluation of Summarization (HighRES), in which summaries are assessed by multiple annotators against the source document via manually highlighted salient content in the latter, which improves inter-annotator agreement in comparison to using the source documents directly.
PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization
TLDR
This work proposes pre-training large Transformer-based encoder-decoder models on massive text corpora with a new self-supervised objective, PEGASUS, and demonstrates it achieves state-of-the-art performance on all 12 downstream datasets measured by ROUGE scores.
Unifying Human and Statistical Evaluation for Natural Language Generation
TLDR
This paper proposes a unified framework which evaluates both diversity and quality, based on the optimal error rate of predicting whether a sentence is human- or machine-generated, called HUSE, which is efficiently estimated by combining human and statistical evaluation.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
TLDR
This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
TLDR
A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.
ChatEval: A Tool for Chatbot Evaluation
TLDR
A unified framework for human evaluation of chatbots that augments existing tools and provides a web-based hub for researchers to share and compare their dialog systems and open-source baseline models and evaluation datasets are introduced.
MOCHA: A Dataset for Training and Evaluating Generative Reading Comprehension Metrics
TLDR
A Learned Evaluation metric for Reading Comprehension, LERC, is trained to mimic human judgement scores, which achieves 80% accuracy and outperforms baselines by 14 to 26 absolute percentage points while leaving significant room for improvement.
Abstractive Summarization of Reddit Posts with Multi-level Memory Networks
TLDR
This work collects Reddit TIFU dataset, consisting of 120K posts from the online discussion forum Reddit, and proposes a novel abstractive summarization model named multi-level memory networks (MMN), equipped with multi- level memory to store the information of text from different levels of abstraction.
...
...