• Corpus ID: 235829401

How Much Can CLIP Benefit Vision-and-Language Tasks?

@article{Shen2021HowMC,
  title={How Much Can CLIP Benefit Vision-and-Language Tasks?},
  author={Sheng Shen and Liunian Harold Li and Hao Tan and Mohit Bansal and Anna Rohrbach and Kai-Wei Chang and Zhewei Yao and Kurt Keutzer},
  journal={ArXiv},
  year={2021},
  volume={abs/2107.06383}
}
Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using a relatively small set of manuallyannotated data (as compared to web-crawled data), to perceive the visual world. However, it has been observed that large-scale pretraining usually can result in better generalization performance, e.g., CLIP (Contrastive Language-Image Pre-training), trained on a massive amount of image-caption pairs, has shown a strong zero-shot capability on various vision tasks. To… 
CLIP Models are Few-Shot Learners: Empirical Studies on VQA and Visual Entailment
TLDR
This work empirically shows that CLIP can be a strong vision-language few-shot learner by leveraging the power of language and proposes a parameter-efficient fine-tuning strategy to boost the few- shot performance on the vqa task.
CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks
TLDR
An evaluation protocol that includes Visual Commonsense Reasoning, Visual Entailment, and Visual Question Answering is introduced, and an approach, named CLIP Targeted Distillation (CLIP-TD), to intelligently distill knowledge from CLIP into existing architectures using a dynamically weighted objective applied to adaptively selected tokens per instance is proposed.
The Unreasonable Effectiveness of CLIP Features for Image Captioning: An Experimental Analysis
TLDR
Through exten-sive experiments, it is shown how CLIP can significantly out-perform widely-used visual encoders and quantify its role under different architectures, variants, and evaluation pro-tocols, ranging from classical captioning performance to zero-shot transfer.
Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation
TLDR
The dual-stream VLP model is augmented with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD), enabling the capability for multimodal generation and the original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes the model versatile for both multimodals and unimodal tasks.
VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks
TLDR
This paper introduces adapter-based parameter-efficient transfer learning techniques to V&L models such as VL-BART and Vl-T5 and demonstrates that training the adapter with the weight-sharing technique can match the performance of the entire model.
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
TLDR
This work presents FIBER (Fusion-In-the-Backbone-based transformER), a new VL model architecture that can seamlessly handle both these types of tasks and provides consistent performance improvements over strong baselines across all tasks, often outperforming methods using magnitudes more data.
An Empirical Study of Training End-to-End Vision-and-Language Transformers
TLDR
TER is presented, a M ultimodal E nd-to-end T ransform ER framework, through which it is investigated how to design and pre-train a fully transformer-based VL model in an end- to-end manner and provides insights on how to train a performant VL transformer.
From Show to Tell: A Survey on Deep Learning-based Image Captioning.
TLDR
This work aims at providing a comprehensive overview of image captioning approaches, from visual encoding and text generation to training strategies, datasets, and evaluation metrics, and quantitatively compare many relevant state-of-the-art approaches to identify the most impactful technical innovations in architectures and training strategies.
Simple but Effective: CLIP Embeddings for Embodied AI
TLDR
One of the baselines is extended, producing an agent capable of zero-shot object navigation that can navigate to objects that were not used as targets during training, and it beats the winners of the 2021 Habitat ObjectNav Challenge, which employ auxiliary tasks, depth maps, and human demonstrations, and those of the 2019 Habitat PointNav Challenge.
GIT: A Generative Image-to-text Transformer for Vision and Language
TLDR
This paper designs and train a GIT to unify vision-language tasks such as image/video captioning and question answering and presents a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks.
...
...

References

SHOWING 1-10 OF 53 REFERENCES
Learning Visual Representations with Caption Annotations
TLDR
It is argued that captioned images are easily crawlable and can be exploited to supervise the training of visual representations, and proposed hybrid models, with dedicated visual and textual encoders, show that the visual representations learned as a by-product of solving this task transfer well to a variety of target tasks.
Unified Vision-Language Pre-Training for Image Captioning and VQA
TLDR
VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions and VQA 2.0.
Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions
TLDR
This work proposes to conduct “mask-and-predict” pre-training on text-only and image-only corpora and introduces the object tags detected by an object recognition model as anchor points to bridge two modalities and finds that such a simple approach achieves performance close to a model pre-trained with aligned data, on four English V&L benchmarks.
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
TLDR
A minimal VLP model, Vision-andLanguage Transformer (ViLT), monolithic in the sense that processing of visual inputs is drastically simplified to just the same convolution-free manner that the authors process textual inputs, showing that ViLT is up to 60 times faster than previous VLP models, yet with competitive or better downstream task performance.
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
TLDR
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss, and it is shown that the scale of the corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
In Defense of Grid Features for Visual Question Answering
TLDR
This paper revisits grid features for VQA, and finds they can work surprisingly well -- running more than an order of magnitude faster with the same accuracy (e.g. if pre-trained in a similar fashion).
VinVL: Revisiting Visual Representations in Vision-Language Models
TLDR
This paper develops an improved object detection model to provide object-centric representations of images and feeds the visual features generated into a Transformer-based VL fusion model OSCAR, and utilizes an improved approach OSCar+ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks.
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
TLDR
This paper proposes a new learning method Oscar (Object-Semantics Aligned Pre-training), which uses object tags detected in images as anchor points to significantly ease the learning of alignments.
VisualBERT: A Simple and Performant Baseline for Vision and Language
TLDR
Analysis demonstrates that VisualBERT can ground elements of language to image regions without any explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between verbs and image regions corresponding to their arguments.
Visual7W: Grounded Question Answering in Images
TLDR
A semantic link between textual descriptions and image regions by object-level grounding enables a new type of QA with visual answers, in addition to textual answers used in previous work, and proposes a novel LSTM model with spatial attention to tackle the 7W QA tasks.
...
...