Reliable Visual Question Answering: Abstain Rather Than Answer Incorrectly

@article{Whitehead2022ReliableVQ,
  title={Reliable Visual Question Answering: Abstain Rather Than Answer Incorrectly},
  author={Spencer Whitehead and Suzanne Petryk and Vedaad Shakib and Joseph E. Gonzalez and Trevor Darrell and Anna Rohrbach and Marcus Rohrbach},
  journal={ArXiv},
  year={2022},
  volume={abs/2204.13631}
}
Machine learning has advanced dramatically, narrowing the accuracy gap to humans in multimodal tasks like visual question answering (VQA). However, while humans can say “ I don’t know ” when they are uncertain (i.e., abstain from answering a question), such ability has been largely neglected in multimodal research, despite the importance of this problem to the usage of VQA in real settings. In this work, we promote a problem formulation for reliable VQA , where we prefer abstention over… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 84 REFERENCES
Selective Question Answering under Domain Shift
TLDR
This work proposes the setting of selective question answering under domain shift, in which a QA model is tested on a mixture of in-domain and out-of-domain data, and must answer as many questions as possible while maintaining high accuracy.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
How Much Can CLIP Benefit Vision-and-Language Tasks?
TLDR
It is shown that CLIP significantly outperforms widely-used visual encoders trained with in-domain annotated data, such as BottomUp-TopDown, and also establishes new state-of-the-art results on Visual Question Answering, Visual Entailment, and V&L Navigation tasks.
VisualBERT: A Simple and Performant Baseline for Vision and Language
TLDR
Analysis demonstrates that VisualBERT can ground elements of language to image regions without any explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between verbs and image regions corresponding to their arguments.
To reject or not to reject: that is the question-an answer in case of neural classifiers
TLDR
A method defining a reject option that is applicable to a given 0-reject classifier and a function P characterizing the reject option's adequacy to the domain has been introduced, showing that P can be expressed as a function of /spl sigma/ and the optimal value for /spl Sigma/ is defined as the one which maximizes the function P.
Obtaining Well Calibrated Probabilities Using Bayesian Binning
TLDR
A new non-parametric calibration method called Bayesian Binning into Quantiles (BBQ) is presented which addresses key limitations of existing calibration methods and can be readily combined with many existing classification algorithms.
On the Foundations of Noise-free Selective Classification
TLDR
This paper presents in this paper a thorough analysis of selective classification including characterizations of RC trade-offs in various interesting settings and constructs algorithms that can optimally or near optimally achieve the best possible trade-off in a controlled manner.
SelectiveNet: A Deep Neural Network with an Integrated Reject Option
TLDR
This work considers the problem of selective prediction in deep neural networks, and introduces SelectiveNet, a deep neural architecture with an integrated reject option that is trained to optimize both classification (or regression) and rejection simultaneously, end-to-end.
VizWiz Grand Challenge: Answering Visual Questions from Blind People
TLDR
Evaluation of modern algorithms for answering visual questions and deciding if a visual question is answerable reveals that VizWiz is a challenging dataset, which is introduced to encourage a larger community to develop more generalized algorithms that can assist blind people.
An optimum character recognition system using decision functions
  • C. Chow
  • Computer Science
    IRE Trans. Electron. Comput.
  • 1957
The character recognition problem, usually resulting from characters being corrupted by printing deterioration and/or inherent noise of the devices, is considered from the viewpoint of statistical
...
...