• Corpus ID: 244954786

Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand

@article{Kasai2021BidimensionalLG,
  title={Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand},
  author={Jungo Kasai and Keisuke Sakaguchi and Ronan Le Bras and Lavinia Dunagan and Jacob Morrison and Alexander R. Fabbri and Yejin Choi and Noah A. Smith},
  journal={ArXiv},
  year={2021},
  volume={abs/2112.04139}
}
Natural language processing researchers have identified limitations of evaluation methodol-ogy for generation tasks, with new questions raised about the validity of automatic metrics and of crowdworker judgments. Meanwhile, efforts to improve generation models tend to depend on simple n-gram overlap metrics (e.g., BLEU, ROUGE). We argue that new advances on models and metrics should each more directly benefit and inform the other. We therefore propose a generalization of leaderboards… 
A global analysis of metrics used for measuring performance in natural language processing
TLDR
The results suggest that the large majority of natural language processing metrics currently used have properties that may result in an inadequate reflection of a models’ performance, and ambiguities and inconsistencies in the reporting of metrics may lead to difficulties in interpreting and comparing model performances, impairing transparency and reproducibility in NLP research.
Twist Decoding: Diverse Generators Guide Each Other
TLDR
This work introduces T WIST decoding, a simple and general inference algorithm that generates text while benefiting from diverse models, and hopes it will encourage researchers and practitioners to examine generation models collectively, not just indepen-dently, and to seek out models with complementary strengths to the currently available models.
Transparent Human Evaluation for Image Captioning
TLDR
Th UM B is established, a rubric-based human evaluation protocol for image captioning models that reveals that CLIPScore, a recent metric that uses image features, better correlates with human judgments than conventional text-only metrics because it is more sen-sitive to recall.
Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text
TLDR
This paper surveys the issues with human and automatic model evaluations and with commonly used datasets in NLG that have been pointed out over the past 20 years and lays out a long-term vision for NLG evaluation and proposes concrete steps to improve their evaluation processes.
ELQA: A Corpus of Questions and Answers about the English Language
TLDR
Three tasks based on the ELQA corpus are introduced: answer quality classification, semantic search for finding similar questions, and answer generation, and baselines for each task are presented, showing the strengths and weaknesses of current transformer-based models.
Slovene SuperGLUE Benchmark: Translation and Evaluation
TLDR
The results show that the monolingual Slovene SloBERTa model is superior to massively multilingual and trilingual BERT models, but these also show a good cross-lingual performance on certain tasks, and the performance of Slovene models still lags behind the best English models.
Why only Micro-F1? Class Weighting of Measures for Relation Classification
TLDR
This work introduces a framework for weighting schemes, where existing schemes are extremes, and two new intermediate schemes, and shows that reporting results of different weighting scheme better highlights strengths and weaknesses of a model.
Beam Decoding with Controlled Patience
TLDR
A patience factor is introduced, a simple modification to this decoding algorithm, that generalizes the stopping criterion and provides accessibility to the depth of search and is readily incorporated in any implementation.
Challenges in Measuring Bias via Open-Ended Language Generation
Researchers have devised numerous ways to quantify social biases vested in pretrained language models. As some language models are capable of generating coherent completions given a set of textual
Towards Automated Document Revision: Grammatical Error Correction, Fluency Edits, and Beyond
Natural language processing technology has rapidly improved automated grammatical error correction tasks, and the community begins to explore document-level revision as one of the next challenges. To

References

SHOWING 1-10 OF 118 REFERENCES
Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation
TLDR
This work proposes an evaluation methodology grounded in explicit error analysis, based on the Multidimensional Quality Metrics (MQM) framework, and carries out the largest MQM research study to date, scoring the outputs of top systems from the WMT 2020 shared task in two language pairs using annotations provided by professional translators with full document context.
How Robust are Model Rankings : A Leaderboard Customization Approach for Equitable Evaluation
TLDR
A task-agnostic method to probe leaderboards by weighting samples based on their ‘difficulty’ level finds that leaderboards can be adversarially attacked and top performing models may not always be the best models.
Answers Unite! Unsupervised Metrics for Reinforced Summarization Models
TLDR
This work explores and proposes alternative evaluation measures and reports that the reported human-evaluation analysis shows that the proposed metrics, based on Question Answering, favorably compare to ROUGE – with the additional property of not requiring reference summaries.
Sentence Mover’s Similarity: Automatic Evaluation for Multi-Sentence Texts
TLDR
This work introduces methods based on sentence mover’s similarity, and finds that sentence-based metrics correlate with human judgments significantly better than ROUGE, both on machine-generated summaries and human-authored essays.
Results of the WMT19 Metrics Shared Task: Segment-Level and Strong MT Systems Pose Big Challenges
TLDR
The outputs of the translations systems competing in the WMT19 News Translation Task with automatic metrics were asked to score the outputs, and metrics were evaluated on the system level, how well a given metric correlates with the W MT19 official manual ranking, and segment level,How well the metrics correlates with human judgements of segment quality.
Fine-Tuning Language Models from Human Preferences
TLDR
This paper builds on advances in generative pretraining of language models to apply reward learning to four natural language tasks: continuing text with positive sentiment or physically descriptive language, and summarization tasks on the TL;DR and CNN/Daily Mail datasets.
SPICE: Semantic Propositional Image Caption Evaluation
There is considerable interest in the task of automatically generating image captions. However, evaluation is challenging. Existing automatic evaluation metrics are primarily sensitive to n-gram
Transparent Human Evaluation for Image Captioning
TLDR
Th UM B is established, a rubric-based human evaluation protocol for image captioning models that reveals that CLIPScore, a recent metric that uses image features, better correlates with human judgments than conventional text-only metrics because it is more sen-sitive to recall.
Unified Language Model Pre-training for Natural Language Understanding and Generation
TLDR
A new Unified pre-trained Language Model (UniLM) that can be fine-tuned for both natural language understanding and generation tasks that compares favorably with BERT on the GLUE benchmark, and the SQuAD 2.0 and CoQA question answering tasks.
All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text
TLDR
The role untrained human evaluations play in NLG evaluation is examined and three approaches for quickly training evaluators to better identify GPT3-authored text are explored and it is found that while evaluation accuracy improved up to 55%, it did not significantly improve across the three domains.
...
1
2
3
4
5
...