• Corpus ID: 231591445

Learning Transferable Visual Models From Natural Language Supervision

@inproceedings{Radford2021LearningTV,
  title={Learning Transferable Visual Models From Natural Language Supervision},
  author={Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
  booktitle={ICML},
  year={2021}
}
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an… 
K-LITE: Learning Transferable Visual Models with External Knowledge
TLDR
This paper proposes K-L ITE, a simple strategy to leverage external knowledge to build transferable visual systems, and proposes knowledge-augmented models that show signs of improvement in transfer learning performance over existing methods.
Data-Efficient Language-Supervised Zero-Shot Learning with Self-Distillation
TLDR
This work proposes a data-efficient contrastive distillation method that uses soft labels to learn from noisy image-text pairs and exceeds the previous SoTA of general zero-shot learning on ImageNet 21k+1k by 73% relatively with a ResNet50 image encoder and DeCLUTR text encoder.
A Fistful of Words: Learning Transferable Visual Models from Bag-of-Words Supervision
TLDR
This paper shows that a simple Bag-of-Words (BoW) caption could be used as a replacement for most of the image captions in the dataset and observes that this approach improves the zero-shot classification performance when combined with word balancing.
Learning to Prompt for Vision-Language Models
TLDR
Context Optimization (CoOp) is proposed, a simple approach specifically for adapting CLIP-like vision-language models for downstream image recognition that requires as few as one or two shots to beat hand-crafted prompts with a decent margin and is able to gain significant improvements when using more shots.
A Simple Long-Tailed Recognition Baseline via Vision-Language Model
TLDR
This work proposes BALLAD, a simple and effective approach to leverage contrastive vision-language models for long-tailed recognition that sets the new state-of-the-art performances and outperforms competitive baselines with a large margin.
Data Efficient Language-supervised Zero-shot Recognition with Optimal Transport Distillation
TLDR
OTTER (Optimal TransporT distillation for Efficient zero-shot Recognition), which uses online entropic optimal transport to find a soft image-text match as labels for contrastive learning and achieves strong performance with only 3M image text pairs.
Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations
TLDR
This work describes how to generate a dataset with over a billion images via large weakly-supervised pretraining to improve the performance of these visual representations, and leverage Transformers to replace the traditional convolutional backbone, with insights into both system and performance improvements, especially at 1B+ image scale.
Multimodal Few-Shot Learning with Frozen Language Models
TLDR
The resulting system is a multimodal few-shot learner, with the surprising ability to learn a variety of new tasks when conditioned on examples, represented as a sequence of multiple interleaved image and text embeddings.
Learning to Generate Scene Graph from Natural Language Supervision
TLDR
This paper proposes one of the first methods that learn from image-sentence pairs to extract a graphical representation of localized objects and their relationships within an image, known as scene graph, and designs a Transformer-based model to predict these "pseudo" labels via a masked token prediction task.
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
TLDR
BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones, and demonstrates strong generalization ability when directly transferred to videolanguage tasks in a zero-shot manner.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 226 REFERENCES
Learning Visual Representations with Caption Annotations
TLDR
It is argued that captioned images are easily crawlable and can be exploited to supervise the training of visual representations, and proposed hybrid models, with dedicated visual and textual encoders, show that the visual representations learned as a by-product of solving this task transfer well to a variety of target tasks.
Learning Everything about Anything: Webly-Supervised Visual Concept Learning
TLDR
A fully-automated approach for learning extensive models for a wide range of variations within any concept, which leverages vast resources of online books to discover the vocabulary of variance, and intertwines the data collection and modeling steps to alleviate the need for explicit human supervision in training the models.
A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark
Representation learning promises to unlock deep learning for the long tail of vision tasks without expensive labelled datasets. Yet, the absence of a unified evaluation for general visual
Predicting Deep Zero-Shot Convolutional Neural Networks Using Textual Descriptions
TLDR
A new model is presented that can classify unseen categories from their textual description and takes advantage of the architecture of CNNs and learn features at different layers, rather than just learning an embedding space for both modalities, as is common with existing approaches.
Zero-Shot Learning Through Cross-Modal Transfer
TLDR
This work introduces a model that can recognize objects in images even if no training data is available for the object class, and uses novelty detection methods to differentiate unseen classes from seen classes.
Self-Supervised Learning of Visual Features through Embedding Images into Text Topic Spaces
TLDR
It is shown that discriminative visual features can be learnt efficiently by training a CNN to predict the semantic context in which a particular image is more probable to appear as an illustration, and the hidden semantic structures discovered in the text corpus are leverage with a well-known topic modeling technique.
Exploring the Limits of Weakly Supervised Pretraining
TLDR
This paper presents a unique study of transfer learning with large convolutional networks trained to predict hashtags on billions of social media images and shows improvements on several image classification and object detection tasks, and reports the highest ImageNet-1k single-crop, top-1 accuracy to date.
Generative Pretraining From Pixels
TLDR
This work trains a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure, and finds that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification.
Learning Visual N-Grams from Web Data
TLDR
This paper develops visual n-gram models that can predict arbitrary phrases that are relevant to the content of an image, and demonstrates the merits of the models in phrase prediction, phrase-based image retrieval, relating images and captions, and zero-shot transfer.
Language Models are Unsupervised Multitask Learners
TLDR
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
...
1
2
3
4
5
...