GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation
@article{Khashabi2021GENIEAL, title={GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation}, author={Daniel Khashabi and Gabriel Stanovsky and Jonathan Bragg and Nicholas Lourie and Jungo Kasai and Yejin Choi and Noah A. Smith and Daniel S. Weld}, journal={ArXiv}, year={2021}, volume={abs/2101.06561} }
Leaderboards have eased model development for many NLP datasets by standardizing their evaluation and delegating it to an independent external repository. Their adoption, however, is so far limited to tasks which can be reliably evaluated in an automatic manner. This work introduces GENIE, an extensible human evaluation leaderboard, which brings the ease of leaderboards to text generation tasks. GENIE automatically posts leaderboard submissions to crowdsourcing platforms asking human annotators…
35 Citations
All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text
- PsychologyACL
- 2021
The role untrained human evaluations play in NLG evaluation is examined and three approaches for quickly training evaluators to better identify GPT3-authored text are explored and it is found that while evaluation accuracy improved up to 55%, it did not significantly improve across the three domains.
Think you have Solved Direct-Answer Question Answering? Try ARC-DA, the Direct-Answer AI2 Reasoning Challenge
- Computer ScienceArXiv
- 2021
The ARC-DA dataset is presented, a direct-answer (“open response”, “freeform”) version of the ARC (AI2 Reasoning Challenge) multiple-choice dataset, one of the first DA datasets of natural questions that often require reasoning, and where appropriate question decompositions are not evident from the questions themselves.
Methods for the Design and Evaluation of HCI+NLP Systems
- Computer ScienceHCINLP
- 2021
Five methodological proposals at the intersection of HCI and NLP are presented and situate them in the context of ML-based NLP models.
Control Prefixes for Parameter-Efficient Text Generation
- Computer Science
- 2021
A dynamic method, C ON TROL P REFIXES, is proposed, which allows for the inclu-sion of conditional input-dependent information, combining the benefits of prompt tuning and controlled generation, and can even outperform fulltuning methods.
How to Evaluate Your Dialogue Models: A Review of Approaches
- Computer ScienceArXiv
- 2021
This survey, in which an explicit and comprehensive analysis of the existing methods is sought, divides the evaluation methods into three classes, i.e., automatic evaluation, human-involved evaluation and user simulator based evaluation.
How much coffee was consumed during EMNLP 2019? Fermi Problems: A New Reasoning Challenge for AI
- Computer ScienceEMNLP
- 2021
Several unsolved AI problems are crystallized into a single, new challenge, namely Fermi Problems (FPs), which are questions whose answers can only be approximately estimated because their precise computation is either impractical or impossible.
The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics
- Computer ScienceGEM
- 2021
GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics, is introduced and the description of the data for the 2021 shared task at the associated GEM Workshop is described.
Findings of the 2021 Conference on Machine Translation (WMT21)
- Computer ScienceWMT
- 2021
This paper presents the results of the newstranslation task, the multilingual low-resourcetranslation for Indo-European languages, thetriangular translation task, and the automaticpost-editing task…
A Survey of Knowledge-Enhanced Text Generation
- Computer ScienceACM Computing Surveys
- 2022
A comprehensive review of the research on knowledge-enhanced text generation over the past five years is presented, which includes two parts: (i) general methods and architectures for integrating knowledge into text generation; (ii) specific techniques and applications according to different forms of knowledge data.
Are All the Datasets in Benchmark Necessary? A Pilot Study of Dataset Evaluation for Text Classification
- Computer ScienceArXiv
- 2022
Preliminary experiments promisingly show that given a sufficient number of training experimental records, a meaningful predictor can be learned to estimate dataset discrimination over unseen datasets.
References
SHOWING 1-10 OF 67 REFERENCES
Curious Case of Language Generation Evaluation Metrics: A Cautionary Tale
- Computer ScienceCOLING
- 2020
These experiments show that metrics usually prefer system outputs to human-authored texts, can be insensitive to correct translations of rare words, and can yield surprisingly high scores when given a single sentence as system output for the entire test set.
Evaluation of Text Generation: A Survey
- Computer ScienceArXiv
- 2020
This paper surveys evaluation methods of natural language generation (NLG) systems that have been developed in the last few years, with a focus on the evaluation of recently proposed NLG tasks and neural NLG models.
HighRES: Highlight-based Reference-less Evaluation of Summarization
- Computer ScienceACL
- 2019
A novel approach for manual evaluation, Highlight-based Reference-less Evaluation of Summarization (HighRES), in which summaries are assessed by multiple annotators against the source document via manually highlighted salient content in the latter, which improves inter-annotator agreement in comparison to using the source documents directly.
Unifying Human and Statistical Evaluation for Natural Language Generation
- Computer ScienceNAACL
- 2019
This paper proposes a unified framework which evaluates both diversity and quality, based on the optimal error rate of predicting whether a sentence is human- or machine-generated, called HUSE, which is efficiently estimated by combining human and statistical evaluation.
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
- Computer ScienceBlackboxNLP@EMNLP
- 2018
A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.
ChatEval: A Tool for Chatbot Evaluation
- Computer ScienceNAACL
- 2019
A unified framework for human evaluation of chatbots that augments existing tools and provides a web-based hub for researchers to share and compare their dialog systems and open-source baseline models and evaluation datasets are introduced.
MOCHA: A Dataset for Training and Evaluating Generative Reading Comprehension Metrics
- Computer ScienceEMNLP
- 2020
A Learned Evaluation metric for Reading Comprehension, LERC, is trained to mimic human judgement scores, which achieves 80% accuracy and outperforms baselines by 14 to 26 absolute percentage points while leaving significant room for improvement.
Abstractive Summarization of Reddit Posts with Multi-level Memory Networks
- Computer ScienceNAACL
- 2019
This work collects Reddit TIFU dataset, consisting of 120K posts from the online discussion forum Reddit, and proposes a novel abstractive summarization model named multi-level memory networks (MMN), equipped with multi- level memory to store the information of text from different levels of abstraction.
Language Models are Unsupervised Multitask Learners
- Computer Science
- 2019
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
Evaluating Machines by their Real-World Language Use
- Computer ScienceArXiv
- 2020
This work proposes to evaluate machines by their success at real-world language use -- which greatly expands the scope of language tasks that can be measured and studied, and introduces TuringAdvice, a new challenge for language understanding systems.