• Corpus ID: 231632593

GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation

@article{Khashabi2021GENIEAL,
  title={GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation},
  author={Daniel Khashabi and Gabriel Stanovsky and Jonathan Bragg and Nicholas Lourie and Jungo Kasai and Yejin Choi and Noah A. Smith and Daniel S. Weld},
  journal={ArXiv},
  year={2021},
  volume={abs/2101.06561}
}
Leaderboards have eased model development for many NLP datasets by standardizing their evaluation and delegating it to an independent external repository. Their adoption, however, is so far limited to tasks which can be reliably evaluated in an automatic manner. This work introduces GENIE, an extensible human evaluation leaderboard, which brings the ease of leaderboards to text generation tasks. GENIE automatically posts leaderboard submissions to crowdsourcing platforms asking human annotators… 

Figures and Tables from this paper

All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text
TLDR
The role untrained human evaluations play in NLG evaluation is examined and three approaches for quickly training evaluators to better identify GPT3-authored text are explored and it is found that while evaluation accuracy improved up to 55%, it did not significantly improve across the three domains.
Think you have Solved Direct-Answer Question Answering? Try ARC-DA, the Direct-Answer AI2 Reasoning Challenge
TLDR
The ARC-DA dataset is presented, a direct-answer (“open response”, “freeform”) version of the ARC (AI2 Reasoning Challenge) multiple-choice dataset, one of the first DA datasets of natural questions that often require reasoning, and where appropriate question decompositions are not evident from the questions themselves.
Methods for the Design and Evaluation of HCI+NLP Systems
TLDR
Five methodological proposals at the intersection of HCI and NLP are presented and situate them in the context of ML-based NLP models.
Control Prefixes for Parameter-Efficient Text Generation
TLDR
A dynamic method, C ON TROL P REFIXES, is proposed, which allows for the inclu-sion of conditional input-dependent information, combining the benefits of prompt tuning and controlled generation, and can even outperform fulltuning methods.
How to Evaluate Your Dialogue Models: A Review of Approaches
TLDR
This survey, in which an explicit and comprehensive analysis of the existing methods is sought, divides the evaluation methods into three classes, i.e., automatic evaluation, human-involved evaluation and user simulator based evaluation.
How much coffee was consumed during EMNLP 2019? Fermi Problems: A New Reasoning Challenge for AI
TLDR
Several unsolved AI problems are crystallized into a single, new challenge, namely Fermi Problems (FPs), which are questions whose answers can only be approximately estimated because their precise computation is either impractical or impossible.
The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics
TLDR
GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics, is introduced and the description of the data for the 2021 shared task at the associated GEM Workshop is described.
Findings of the 2021 Conference on Machine Translation (WMT21)
This paper presents the results of the newstranslation task, the multilingual low-resourcetranslation for Indo-European languages, thetriangular translation task, and the automaticpost-editing task
A Survey of Knowledge-Enhanced Text Generation
TLDR
A comprehensive review of the research on knowledge-enhanced text generation over the past five years is presented, which includes two parts: (i) general methods and architectures for integrating knowledge into text generation; (ii) specific techniques and applications according to different forms of knowledge data.
Are All the Datasets in Benchmark Necessary? A Pilot Study of Dataset Evaluation for Text Classification
TLDR
Preliminary experiments promisingly show that given a sufficient number of training experimental records, a meaningful predictor can be learned to estimate dataset discrimination over unseen datasets.
...
1
2
3
4
...

References

SHOWING 1-10 OF 67 REFERENCES
Curious Case of Language Generation Evaluation Metrics: A Cautionary Tale
TLDR
These experiments show that metrics usually prefer system outputs to human-authored texts, can be insensitive to correct translations of rare words, and can yield surprisingly high scores when given a single sentence as system output for the entire test set.
Evaluation of Text Generation: A Survey
TLDR
This paper surveys evaluation methods of natural language generation (NLG) systems that have been developed in the last few years, with a focus on the evaluation of recently proposed NLG tasks and neural NLG models.
HighRES: Highlight-based Reference-less Evaluation of Summarization
TLDR
A novel approach for manual evaluation, Highlight-based Reference-less Evaluation of Summarization (HighRES), in which summaries are assessed by multiple annotators against the source document via manually highlighted salient content in the latter, which improves inter-annotator agreement in comparison to using the source documents directly.
Unifying Human and Statistical Evaluation for Natural Language Generation
TLDR
This paper proposes a unified framework which evaluates both diversity and quality, based on the optimal error rate of predicting whether a sentence is human- or machine-generated, called HUSE, which is efficiently estimated by combining human and statistical evaluation.
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
TLDR
A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.
ChatEval: A Tool for Chatbot Evaluation
TLDR
A unified framework for human evaluation of chatbots that augments existing tools and provides a web-based hub for researchers to share and compare their dialog systems and open-source baseline models and evaluation datasets are introduced.
MOCHA: A Dataset for Training and Evaluating Generative Reading Comprehension Metrics
TLDR
A Learned Evaluation metric for Reading Comprehension, LERC, is trained to mimic human judgement scores, which achieves 80% accuracy and outperforms baselines by 14 to 26 absolute percentage points while leaving significant room for improvement.
Abstractive Summarization of Reddit Posts with Multi-level Memory Networks
TLDR
This work collects Reddit TIFU dataset, consisting of 120K posts from the online discussion forum Reddit, and proposes a novel abstractive summarization model named multi-level memory networks (MMN), equipped with multi- level memory to store the information of text from different levels of abstraction.
Language Models are Unsupervised Multitask Learners
TLDR
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
Evaluating Machines by their Real-World Language Use
TLDR
This work proposes to evaluate machines by their success at real-world language use -- which greatly expands the scope of language tasks that can be measured and studied, and introduces TuringAdvice, a new challenge for language understanding systems.
...
1
2
3
4
5
...