• Publications
  • Influence
Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions
TLDR
Due to a pervasive lack of clarity in reports and extreme diversity in approaches, human evaluation in NLG presents as extremely confused in 2020, and that the field is in urgent need of standard methods and terminology.
The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics
TLDR
GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics, is introduced and the description of the data for the 2021 shared task at the associated GEM Workshop is described.
A Study of Automatic Metrics for the Evaluation of Natural Language Explanations
TLDR
The ExBAN corpus is presented: a crowd-sourced corpus of NL explanations for Bayesian Networks and it is found that embedding-based automatic NLG evaluation methods have a higher correlation with human ratings, compared to word-overlap metrics, such as BLEU and ROUGE.
A Survey of Explainable AI Terminology
TLDR
An analysis of the existing research literature is presented and key terms, such as transparency, intelligibility, interpretability, and explainability are referred to and in what context are examined, to move towards a standard terminology for Explainable AI.
Underreporting of errors in NLG output, and what to do about it
TLDR
There is a severe under-reporting of the different kinds of errors that Natural Language Generation systems make, and this position paper provides recommendations for error identification, analysis and reporting.
It’s Commonsense, isn’t it? Demystifying Human Evaluations in Commonsense-Enhanced NLG Systems
TLDR
The Commonsense Evaluation Card (CEC) is proposed, a set of recommendations for evaluation reporting of commonsense-enhanced NLG systems, underpinned by an extensive analysis of human evaluations reported in the recent literature.
You reap what you sow: On the Challenges of Bias Evaluation Under Multilingual Settings
TLDR
Three dimensions of developing multilingual bias evaluation frameworks are highlighted: increasing transparency through documentation, expanding targets of bias beyond gender, and addressing cultural differences that exist between languages.
Emergent Structures and Training Dynamics in Large Language Models
TLDR
It is noted in particular the lack of sufficient research on the emergence of functional units, subsections of the network where related functions are grouped or organised, within large language models and motivated work that grounds the study of language models in an analysis of their changing internal structure during training time.
I don't understand! Evaluation Methods for Natural Language Explanations
TLDR
This paper presents existing work on how evaluation methods from the field of Natural Language Generation (NLG) can be mapped onto NL explanations, and presents a preliminary investigation into the relationship between linguistic features and human evaluation, using a dataset of NL explanations derived from Bayesian Networks.