• Corpus ID: 229924292

Human Evaluation of Spoken vs. Visual Explanations for Open-Domain QA

  title={Human Evaluation of Spoken vs. Visual Explanations for Open-Domain QA},
  author={Ana Valeria Gonz{\'a}lez and Gagan Bansal and Angela Fan and Robin Jia and Yashar Mehdad and Srini Iyer},
While research on explaining predictions of open-domain QA systems (ODQA) to users is gaining momentum, most works have failed to evaluate the extent to which explanations improve user trust. While few works evaluate explanations using user studies, they employ settings that may deviate from the enduser’s usage in-the-wild: ODQA is most ubiquitous in voice-assistants, yet current research only evaluates explanations using a visual display, and may erroneously extrapolate conclusions about the… 

Machine Explanations and Human Understanding

This work provides a general framework along with actionable implications for future algorithmic development and empirical experiments of machine explanations and shows how human intuitions play a central role in enabling human understanding.

Teaching Humans When To Defer to a Classifier via Examplars

This work presents a novel parameterization of the human's mental model of the AI that applies a nearest neighbor rule in local regions surrounding the teaching examples to derive a near-optimal strategy for selecting a representative teaching set.

Exploring the Role of Local and Global Explanations in Recommender Systems

The results provide evidence suggesting that both explanations are more helpful than either alone for explaining how to improve recommendations, yet both appeared less helpful than global alone for efficiency in identifying false positives and negatives.

Advancing Human-AI Complementarity: The Impact of User Expertise and Algorithmic Tuning on Joint Decision Making

This paper reports on a study that examines users’ interactions with three simulated algorithmic models, all with equivalent accuracy rates but each tuned differently in terms of true positive and true negative rates, and provides recommendations on how to design and tune AI algorithms to complement users in decision-making tasks.

Towards a Science of Human-AI Decision Making: A Survey of Empirical Studies

The need to develop common frameworks to account for the design and research spaces of human-AI decision making is highlighted, so that researchers can make rigorous choices in study design, and the research community can build on each other’s work and produce generalizable scientific knowledge.

The Utility of Explainable AI in Ad Hoc Human-Machine Teaming

The results demonstrate that researchers must deliberately design and deploy the right xAI techniques in the right scenario by carefully considering human-machine team composition and how the xAI method augments SA.

Generating Full Length Wikipedia Biographies: The Impact of Gender Bias on the Retrieval-Based Generation of Women Biographies

A model for English text is developed that uses a retrieval mechanism to identify relevant sup-porting information on the web and a cache-based pre-trained encoder-decoder to generate long-form biographies section by section, including citation information.

Generating Biographies on Wikipedia: The Impact of Gender Bias on the Retrieval-Based Generation of Women Biographies

A model for English text is developed that uses a retrieval mechanism to identify relevant supporting information on the web and a cache-based pre-trained encoder-decoder to generate long-form biographies section by section, including citation information.

Should I Follow AI-based Advice? Measuring Appropriate Reliance in Human-AI Decision-Making

It is proposed to view AR as a two-dimensional construct that measures the ability to discriminate advice quality and behave accordingly, and derive the measurement concept, illustrate its application and outline potential future research.



Towards Explainable NLP: A Generative Explanation Framework for Text Classification

A novel generative explanation framework that learns to make classification decisions and generate fine-grained explanations at the same time and introduces the explainable factor and the minimum risk training approach that learn to generate more reasonable explanations.

Are Visual Explanations Useful? A Case Study in Model-in-the-Loop Prediction

It is found that presenting model predictions improves human accuracy, but visual explanations of various kinds fail to significantly alter human accuracy or trust in the model - regardless of whether explanations characterize an accurate model, an inaccurate one, or are generated randomly and independently of the input image.

Latent Retrieval for Weakly Supervised Open Domain Question Answering

It is shown for the first time that it is possible to jointly learn the retriever and reader from question-answer string pairs and without any IR system, and outperforming BM25 by up to 19 points in exact match.

QED: A Framework and Dataset for Explanations in Question Answering

A large user study is described showing that the presence of QED explanations significantly improves the ability of untrained raters to spot errors made by a strong neural QA baseline.

e-SNLI: Natural Language Inference with Natural Language Explanations

The Stanford Natural Language Inference dataset is extended with an additional layer of human-annotated natural language explanations of the entailment relations, which can be used for various goals, such as obtaining full sentence justifications of a model’s decisions, improving universal sentence representations and transferring to out-of-domain NLI datasets.

Explain Yourself! Leveraging Language Models for Commonsense Reasoning

This work collects human explanations for commonsense reasoning in the form of natural language sequences and highlighted annotations in a new dataset called Common Sense Explanations to train language models to automatically generate explanations that can be used during training and inference in a novel Commonsense Auto-Generated Explanation framework.

Proxy tasks and subjective measures can be misleading in evaluating explainable AI systems

This work conducted two online experiments and one in-person think-aloud study to evaluate two currently common techniques for evaluating XAI systems: using proxy, artificial tasks such as how well humans predict the AI's decision from the given explanations, and using subjective measures of trust and preference as predictors of actual performance.

Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team Performance

This work conducts mixed-method user studies on three datasets, where an AI with accuracy comparable to humans helps participants solve a task (explaining itself in some conditions), and observes complementary improvements from AI augmentation that were not increased by explanations.

Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior?

Human subject tests are carried out that are the first of their kind to isolate the effect of algorithmic explanations on a key aspect of model interpretability, simulatability, while avoiding important confounding experimental factors.

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

It is shown that, in comparison to other recently introduced large-scale datasets, TriviaQA has relatively complex, compositional questions, has considerable syntactic and lexical variability between questions and corresponding answer-evidence sentences, and requires more cross sentence reasoning to find answers.