Corpus ID: 232035689

Teach Me to Explain: A Review of Datasets for Explainable NLP

@article{Wiegreffe2021TeachMT,
  title={Teach Me to Explain: A Review of Datasets for Explainable NLP},
  author={Sarah Wiegreffe and Ana Marasovi{\'c}},
  journal={ArXiv},
  year={2021},
  volume={abs/2102.12060}
}
Explainable NLP (EXNLP) has increasingly focused on collecting humanannotated explanations. These explanations are used downstream in three ways: as data augmentation to improve performance on a predictive task, as a loss signal to train models to produce explanations for their predictions, and as a means to evaluate the quality of model-generated explanations. In this review, we identify three predominant classes of explanations (highlights, free-text, and structured), organize the literature… Expand

Figures and Tables from this paper

ExplaGraphs: An Explanation Graph Generation Task for Structured Commonsense Reasoning
TLDR
This work presents EXPLAGRAPHS, a new generative and structured commonsense-reasoning task (and an associated dataset) of explanation graph generation for stance prediction and proposes a multi-level evaluation framework that checks for the structural and semantic correctness of the generated graphs and their plausibility with humanwritten graphs. Expand
On the Diversity and Limits of Human Explanations
  • Chenhao Tan
  • Computer Science
  • ArXiv
  • 2021
TLDR
Inspired by prior work in psychology and cognitive sciences, existing human explanations in NLP are group into three categories: proximal mechanism, evidence, and procedure, which differ in nature and have implications for the resultant explanations. Expand
Do Natural Language Explanations Represent Valid Logical Arguments? Verifying Entailment in Explainable NLI Gold Standards
TLDR
A systematic annotation methodology is proposed, named Explanation Entailment Verification (EEV), to quantify the logical validity of human-annotated explanations, and confirms that the inferential properties of explanations are still poorly formalised and understood. Expand
Explainable Machine Learning with Prior Knowledge: An Overview
TLDR
This survey presents an overview of integrating prior knowledge into machine learning systems in order to improve explainability and presents a categorization of current research into three main categories which either integrate knowledge into the machine learning pipeline, into the explainability method or derive knowledge from explanations. Expand
GCRC: A New Challenging MRC Dataset from Gaokao Chinese for Explainable Evaluation
  • Hongye Tan, Xiaoyue Wang, +5 authors Xiaoqi Han
  • Computer Science
  • FINDINGS
  • 2021
TLDR
This paper proposes GCRC, a new dataset with challenging and high-quality multi-choice questions, collected from Gaokao Chinese (Chinese subject from the National College Entrance Examination of China), and shows that the proposed dataset is more challenging and very useful for identifying the limitations of existing MRC systems in an explainable way. Expand
Generating Hypothetical Events for Abductive Inference
TLDR
This work proposes a multi-task model MTL to solve the Abductive NLI task, which predicts a plausible explanation by considering different possible events emerging from candidate hypotheses – events generated by LMI – and selecting the one that is most similar to the observed outcome. Expand
Hybrid Autoregressive Solver for Scalable Abductive Natural Language Inference
TLDR
A hybrid abductive solver that autoregressively combines a dense bi-encoder with a sparse model of explanatory power, computed leveraging explicit patterns in the explanations is proposed, showing that it boosts the quality of the explanations and contributes to improved downstream inference performance. Expand
It’s the Meaning That Counts: The State of the Art in NLP and Semantics
TLDR
This work reviews the state of computational semantics in NLP and investigates how different lines of inquiry reflect distinct understandings of semantics and prioritize different layers of linguistic meaning. Expand
On the Challenges of Evaluating Compositional Explanations in Multi-Hop Inference: Relevance, Completeness, and Expert Ratings
Building compositional explanations requires models to combine two or more facts that, together, describe why the answer to a question is correct. Typically, these “multi-hop” explanations areExpand
Order in the Court: Explainable AI Methods Prone to Disagreement
TLDR
It is argued that rank correlation is largely uninformative and does not measure the quality of featureadditive methods, and the range of conclusions a practitioner may draw from a single explainability algorithm are limited. Expand
...
1
2
...

References

SHOWING 1-10 OF 171 REFERENCES
QED: A Framework and Dataset for Explanations in Question Answering
TLDR
A large user study is described showing that the presence of QED explanations significantly improves the ability of untrained raters to spot errors made by a strong neural QA baseline. Expand
Evaluating and Characterizing Human Rationales
TLDR
Analysis of a variety of datasets and models finds that human rationales do not necessarily perform well on automated metrics, and proposes improved metrics to account for model-dependent baseline performance and two methods to further characterize rationale quality. Expand
GLUCOSE: GeneraLized and COntextualized Story Explanations
TLDR
This paper presents a platform for effectively crowdsourcing GLUCOSE data at scale, which uses semi-structured templates to elicit causal explanations and collects 440K specific statements and general rules that capture implicit commonsense knowledge about everyday situations. Expand
Learning to Explain: Datasets and Models for Identifying Valid Reasoning Chains in Multihop Question-Answering
TLDR
A delexicalized chain representation in which repeated noun phrases are replaced by variables, thus turning them into generalized reasoning chains is explored, finding that generalized chains maintain performance while also being more robust to certain perturbations. Expand
QASC: A Dataset for Question Answering via Sentence Composition
TLDR
This work presents a multi-hop reasoning dataset, Question Answering via Sentence Composition (QASC), that requires retrieving facts from a large corpus and composing them to answer a multiple-choice question, and presents a two-step approach to mitigate the retrieval challenges. Expand
Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets
TLDR
It is shown that model performance improves when training with annotator identifiers as features, and that models are able to recognize the most productive annotators and that often models do not generalize well to examples from annotators that did not contribute to the training set. Expand
Evaluating Explanation Without Ground Truth in Interpretable Machine Learning
TLDR
To benchmark the evaluation in IML, this article rigorously defines the problem of evaluating explanations, and systematically review the existing efforts from state-of-the-arts, and summarizes three general aspects of explanation with formal definitions. Expand
Explain Yourself! Leveraging Language Models for Commonsense Reasoning
TLDR
This work collects human explanations for commonsense reasoning in the form of natural language sequences and highlighted annotations in a new dataset called Common Sense Explanations to train language models to automatically generate explanations that can be used during training and inference in a novel Commonsense Auto-Generated Explanation framework. Expand
From Recognition to Cognition: Visual Commonsense Reasoning
TLDR
To move towards cognition-level understanding, a new reasoning engine is presented, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning. Expand
Inferring Which Medical Treatments Work from Reports of Clinical Trials
TLDR
A new task and corpus for inferring reported findings from a full-text article describing randomized controlled trials (RCT) with respect to a given intervention, comparator, and outcome of interest and results using a suite of baseline models demonstrate the difficulty of the task. Expand
...
1
2
3
4
5
...