Probing Image-Language Transformers for Verb Understanding

  title={Probing Image-Language Transformers for Verb Understanding},
  author={Lisa Anne Hendricks and Aida Nematzadeh},
Multimodal image–language transformers have achieved impressive results on a variety of tasks that rely on fine-tuning (e.g., visual question answering and image retrieval). We are interested in shedding light on the quality of their pretrained representations – in particular, if these models can distinguish different types of verbs or if they rely solely on nouns in a given sentence. To do so, we collect a dataset of image–sentence pairs (in English) consisting of 421 verbs that are either… Expand
2 Citations

Figures and Tables from this paper

Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers
Pretrained vision-and-language BERTs aim to learn representations that combine information from both modalities. We propose a diagnostic method based on cross-modal input ablation to assess theExpand
What Vision-Language Models `See' when they See Scenes
Images can be described in terms of the objects they contain, or in terms of the types of scene or place that they instantiate. In this paper we address to what extent pretrained Vision and LanguageExpand


Words Aren’t Enough, Their Order Matters: On the Robustness of Grounding Visual Referring Expressions
This work critically examines RefCOCOg, a standard benchmark for this task, using a human study and shows that 83.7% of test instances do not require reasoning on linguistic structure, i.e., words are enough to identify the target object, the word order doesn’t matter. Expand
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
This work balances the popular VQA dataset by collecting complementary images such that every question in this balanced dataset is associated with not just a single image, but rather a pair of similar images that result in two different answers to the question. Expand
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to aExpand
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
The LXMERT (Learning Cross-Modality Encoder Representations from Transformers) framework, a large-scale Transformer model that consists of three encoders, achieves the state-of-the-art results on two visual question answering datasets and shows the generalizability of the pre-trained cross-modality model. Expand
Are we pretraining it right? Digging deeper into visio-linguistic pretraining
Numerous recent works have proposed pretraining generic visio-linguistic representations and then finetuning them for downstream vision and language tasks. While architecture and objective functionExpand
VisualBERT: A Simple and Performant Baseline for Vision and Language
Analysis demonstrates that VisualBERT can ground elements of language to image regions without any explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between verbs and image regions corresponding to their arguments. Expand
FOIL it! Find One mismatch between Image and Language caption
It is demonstrated that merely utilising language cues is not enough to model FOIL-COCO and that it challenges the state-of-the-art by requiring a fine-grained understanding of the relation between text and image. Expand
Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering
GVQA explicitly disentangles the recognition of visual concepts present in the image from the identification of plausible answer space for a given question, enabling the model to more robustly generalize across different distributions of answers. Expand
Defoiling Foiled Image Captions
We address the task of detecting foiled image captions, i.e. identifying whether a caption contains a word that has been deliberately replaced by a semantically similar word, thus rendering itExpand
Hierarchical Question-Image Co-Attention for Visual Question Answering
This paper presents a novel co-attention model for VQA that jointly reasons about image and question attention in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN). Expand