• Corpus ID: 230435588

VinVL: Making Visual Representations Matter in Vision-Language Models

@article{Zhang2021VinVLMV,
  title={VinVL: Making Visual Representations Matter in Vision-Language Models},
  author={Pengchuan Zhang and Xiujun Li and Xiaowei Hu and Jianwei Yang and Lei Zhang and Lijuan Wang and Yejin Choi and Jianfeng Gao},
  journal={ArXiv},
  year={2021},
  volume={abs/2101.00529}
}
This paper presents a detailed study of improving visual representations for vision language (VL) tasks and develops an improved object detection model to provide object-centric representations of images. Compared to the most widely used bottom-up and top-down model [2], the new model is bigger, better-designed for VL tasks, and pre-trained on much larger training corpora that combine multiple public annotated object detection datasets. Therefore, it can generate representations of a richer… 
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
TLDR
This paper introduces a contrastive loss to ALign the image and text representations BEfore Fusing through cross-modal attention, which enables more grounded vision and language representation learning and proposes momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention
TLDR
The proposed DiM (short for Disentangled Multimodal-Attention) module can be easily incorporated into existing pre-trained V-L models to boost their performance, up to a 5% increase on the representative task, and the effectiveness of the introduced visual concepts is demonstrated.
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
TLDR
BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones, and demonstrates strong generalization ability when directly transferred to videolanguage tasks in a zero-shot manner.
E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning
TLDR
This paper proposes the first end-to-end vision-language pre-trained model for both V+L understanding and generation, namely E2E-VLP, where a unified Transformer framework is built to jointly learn visual representation, and semantic alignments between image and text.
UFO: A UniFied TransfOrmer for Vision-Language Representation Learning
TLDR
A single UniFied transfOrmer (UFO), which is capable of processing either unimodal inputs or multi-modality inputs, for vision-language (VL) representation learning, to achieve new state of the arts on visual question answering, COCO image captioning and nocaps.
Compressing Visual-linguistic Model via Knowledge Distillation
TLDR
This paper uses the mean square error loss to mimic the attention distribution inside the transformer block, and presents a token-wise noise contrastive loss to align the hidden state by contrasting with negative representations stored in a sample queue.
MLP Architectures for Vision-and-Language Modeling: An Empirical Study
TLDR
The first empirical study on the use of MLP architectures for vision-and-language (VL) fusion finds that without pre-training, using MLPs for multimodal fusion has a noticeable performance gap compared to transformers; however, VL pre- training can help close the performance gap; and suggests that MLPs can effectively learn to align vision and text features extracted from lower-level encoders without heavy reliance on self-attention.
On Guiding Visual Attention with Language Specification
TLDR
This work ground task-relevant words or phrases with attention maps from a pretrained large-scale model and shows that supervising spatial attention in this way improves performance on classification tasks with biased and noisy data, including ∼3−15% worst-group accuracy improvements and∼41−45% relative improvements on fairness metrics.
Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization
TLDR
This paper presents a simple yet effective method to construct vision guided GPLMs for the MAS task using attention-based add-on layers to incorporate visual information while maintaining their original text generation ability.
Vision-Language Pretraining: Current Trends and the Future
In the last few years, there has been an increased interest in building multimodal (vision-language) models that are pretrained on larger but noisier datasets where the two modalities (e.g., image
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 49 REFERENCES
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a
VisualBERT: A Simple and Performant Baseline for Vision and Language
TLDR
Analysis demonstrates that VisualBERT can ground elements of language to image regions without any explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between verbs and image regions corresponding to their arguments.
Unified Vision-Language Pre-Training for Image Captioning and VQA
TLDR
VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions and VQA 2.0.
12-in-1: Multi-Task Vision and Language Representation Learning
TLDR
This work develops a large-scale, multi-task model that culminates in a single model on 12 datasets from four broad categories of task including visual question answering, caption-based image retrieval, grounding referring expressions, and multimodal verification and shows that finetuning task-specific models from this model can lead to further improvements, achieving performance at or above the state-of-the-art.
From captions to visual concepts and back
TLDR
This paper uses multiple instance learning to train visual detectors for words that commonly occur in captions, including many different parts of speech such as nouns, verbs, and adjectives, and develops a maximum-entropy language model.
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
TLDR
After pretraining on large-scale image-caption pairs, Unicoder-VL is transferred to caption-based image-text retrieval and visual commonsense reasoning, with just one additional output layer, and shows the powerful ability of the cross-modal pre-training.
In Defense of Grid Features for Visual Question Answering
TLDR
This paper revisits grid features for VQA, and finds they can work surprisingly well -- running more than an order of magnitude faster with the same accuracy (e.g. if pre-trained in a similar fashion).
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
TLDR
The Visual Genome dataset is presented, which contains over 108K images where each image has an average of $$35$$35 objects, $$26$$26 attributes, and $$21$$21 pairwise relationships between objects, and represents the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answer pairs.
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
TLDR
The LXMERT (Learning Cross-Modality Encoder Representations from Transformers) framework, a large-scale Transformer model that consists of three encoders, achieves the state-of-the-art results on two visual question answering datasets and shows the generalizability of the pre-trained cross-modality model.
Deep Visual-Semantic Alignments for Generating Image Descriptions
  • A. Karpathy, Li Fei-Fei
  • Computer Science
    IEEE Transactions on Pattern Analysis and Machine Intelligence
  • 2017
TLDR
A model that generates natural language descriptions of images and their regions based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding is presented.
...
1
2
3
4
5
...