A Study of Automatic Metrics for the Evaluation of Natural Language Explanations

  title={A Study of Automatic Metrics for the Evaluation of Natural Language Explanations},
  author={Miruna Clinciu and Arash Eshghi and H. Hastie},
As transparency becomes key for robotics and AI, it will be necessary to evaluate the methods through which transparency is provided, including automatically generated natural language (NL) explanations. Here, we explore parallels between the generation of such explanations and the much-studied field of evaluation of Natural Language Generation (NLG). Specifically, we investigate which of the NLG evaluation measures map well to explanations. We present the ExBAN corpus: a crowd-sourced corpus… 

Figures and Tables from this paper

Self-Explainable Robots in Remote Environments
A system that learns from demonstrations to inspect areas in a remote environment and to explain robot behaviour is presented, able to inspect an offshore platform autonomously and explain its decision process both through both image-based and natural language-based interfaces.
Reframing Human-AI Collaboration for Generating Free-Text Explanations
A pipeline that combines GPT-3 with a supervised that incorpo-rates binary acceptability judgments from humans in the loop is created and it is demonstrated that acceptability is partially correlated with various fine-grained attributes of explanations.
Investigating the Benefits of Free-Form Rationales
This work presents human studies which show that ECQA rationales indeed provide additional background information to understand a decision, while over 88% of CoS-E rationales do not, and investigates the utility of rationales as an additional source of supervision by varying the quantity and quality ofrationales during training.
Towards Human-Centred Explainability Benchmarks For Text Classification
This position paper proposes to extend text classification benchmarks to evaluate the explainability of text classifiers and proposes to ground these benchmarks in human-centred applications, for example by using social media, gamification or to learn explainability metrics from human judgements.
On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization
It is shown that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multi-modality self-rationalization of tasks with multimodal inputs and is observed that no model type works universally the best across tasks/datasets and data sizes.
Scientific Explanation and Natural Language: A Unified Epistemological-Linguistic Perspective for Explainable AI
A fundamental research goal for Explainable AI (XAI) is to build models that are capable of reasoning through the generation of natural language explanations . However, the methodologies to design
Evaluating the Evaluation Metrics for Style Transfer: A Case Study in Multilingual Formality Transfer
This paper evaluates leading automatic metrics on the oft-researched task of formality style transfer in Brazilian-Portuguese, French, and Italian, making this work the first multilingual evaluation of metrics in ST.
Does External Knowledge Help Explainable Natural Language Inference? Automatic Evaluation vs. Human Ratings
The largest and most fine-grained explainable NLI crowdsourcing study to date reveals that even large differences in automatic performance scores do neither reflect in human ratings of label, explanation, commonsense nor grammar correctness.
I don't understand! Evaluation Methods for Natural Language Explanations
This paper presents existing work on how evaluation methods from the field of Natural Language Generation (NLG) can be mapped onto NL explanations, and presents a preliminary investigation into the relationship between linguistic features and human evaluation, using a dataset of NL explanations derived from Bayesian Networks.
Measuring Association Between Labels and Free-Text Rationales
It is demonstrated that *pipelines*, models for faithful rationalization on information-extraction style tasks, do not work as well on “reasoning” tasks requiring free-text rationales, and state-of-the-art T5-based joint models exhibit desirable properties for explaining commonsense question-answering and natural language inference.


NILE : Natural Language Inference with Faithful Natural Language Explanations
This work proposes Natural-language Inference over Label-specific Explanations (NILE), a novel NLI method which utilizes auto-generated label-specific NL explanations to produce labels along with its faithful explanation and demonstrates NILE’s effectiveness over previously reported methods through automated and human evaluation of the produced labels and explanations.
e-SNLI: Natural Language Inference with Natural Language Explanations
The Stanford Natural Language Inference dataset is extended with an additional layer of human-annotated natural language explanations of the entailment relations, which can be used for various goals, such as obtaining full sentence justifications of a model’s decisions, improving universal sentence representations and transferring to out-of-domain NLI datasets.
Comparing Automatic and Human Evaluation of NLG Systems
It is found that NI ST scores correlate best with human judgments, but that all automatic metrics the authors examined are biased in favour of generators that select on the basis of frequency alone.
RankME: Reliable Human Ratings for Natural Language Generation
This work presents a novel rank-based magnitude estimation method (RankME), which combines the use of continuous scales and relative assessments, and shows that RankME significantly improves the reliability and consistency of human ratings compared to traditional evaluation methods.
A large annotated corpus for learning natural language inference
The Stanford Natural Language Inference corpus is introduced, a new, freely available collection of labeled sentence pairs, written by humans doing a novel grounded task based on image captioning, which allows a neural network-based model to perform competitively on natural language inference benchmarks for the first time.
Evaluating the State-of-the-Art of End-to-End Natural Language Generation: The E2E NLG Challenge
Why We Need New Evaluation Metrics for NLG
A wide range of metrics are investigated, including state-of-the-art word-based and novel grammar-based ones, and it is demonstrated that they only weakly reflect human judgements of system outputs as generated by data-driven, end-to-end NLG.
Explain Yourself! Leveraging Language Models for Commonsense Reasoning
This work collects human explanations for commonsense reasoning in the form of natural language sequences and highlighted annotations in a new dataset called Common Sense Explanations to train language models to automatically generate explanations that can be used during training and inference in a novel Commonsense Auto-Generated Explanation framework.
Evaluation in the context of natural language generation
What is involved in natural language generation, and how evaluation has figured in work in this area to date is described; a particular text generation application is examined and the issues that are raised in assessing its performance on a variety of dimensions are looked at.
Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation
An up-to-date synthesis of research on the core tasks in NLG and the architectures adopted in which such tasks are organised is given, to highlight a number of recent research topics that have arisen partly as a result of growing synergies betweenNLG and other areas of artifical intelligence.