Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
@inproceedings{Jia2021ScalingUV, title={Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision}, author={Chao Jia and Yinfei Yang and Ye Xia and Yi-Ting Chen and Zarana Parekh and Hieu Pham and Quoc V. Le and Yun-Hsuan Sung and Zhen Li and Tom Duerig}, booktitle={International Conference on Machine Learning}, year={2021} }
Pre-trained representations are becoming crucial for many NLP and perception tasks. While representation learning in NLP has transitioned to training on raw text without human annotations, visual and vision-language representations still rely heavily on curated training datasets that are expensive or require expert knowledge. For vision applications, representations are mostly learned using datasets with explicit class labels such as ImageNet or OpenImages. For vision-language, popular datasets…
Figures and Tables from this paper
828 Citations
Inferring Offensiveness In Images From Natural Language Supervision
- Computer ScienceArXiv
- 2021
It is shown that pre-trained transformers themselves provide a methodology for the automated curation of large-scale vision datasets and that one can select relevant prompts for rating the offensiveness of an image.
Unified Visual Relationship Detection with Vision and Language Models
- Computer Science
- 2023
The UniVRD model is proposed, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models (VLMs), which provides well-aligned image and text embeddings, where similar relationships are optimized to be close to each other for semantic unification.
Grounding Language Models to Images for Multimodal Generation
- Computer ScienceArXiv
- 2023
An ef-fective, general solution for leveraging pretrained language models in visually grounded settings, enabling them to process and generate arbitrarily interleaved image-and-text data.
Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text Features
- Computer ScienceArXiv
- 2022
This work presents a simple, yet effective strategy for open-vocabulary temporal action detection utilizing pretrained image-text co-embeddings, and proposes a more reasonable open- Vocabulary evaluation setting for the ActivityNet data set, where the category splits are based on similarity rather than random assignment.
REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory
- Computer ScienceArXiv
- 2022
An end-to-end Retrieval-Augmented Visual Language Model that learns to encode world knowledge into a large-scale memory, and to retrieve from it to answer knowledge-intensive queries and achieves state-of-the-art results on visual question answering and image captioning.
Human alignment of neural network representations
- Computer Science, BiologyArXiv
- 2022
Overall, although models trained on larger, more diverse datasets achieve better alignment with humans than models training on ImageNet alone, the results indicate that scaling alone is unlikely to be sufficient to train neural networks with conceptual representations that match those used by humans.
MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning
- Computer ScienceArXiv
- 2022
This paper proposes a jointly masked multimodal modeling method that achieves state-of-the-art performance on various downstream vision-language tasks, including image-text retrieval, visual question answering, visual reasoning, and weakly-supervised visual grounding.
Generalization Properties of Retrieval-based Models
- Computer ScienceArXiv
- 2022
A formal treatment of retrieval-based models to characterize their generalization ability is presented, and it is shown that breaking down the underlying learning task into local sub-tasks enables the model to employ a low complexity parametric component to ensure good overall accuracy.
ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training
- Computer ScienceArXiv
- 2022
This paper proposes ERNIE-ViL 2.0, a Multi-View Contrastive learning framework to build intra- modal and inter-modal correlations between diverse views simultaneously, aiming at learning a more robust cross-modAL representation.
U-BERT for Fast and Scalable Text-Image Retrieval
- Computer ScienceICTIR
- 2022
A U-BERT model is proposed to achieve an effective and efficient cross-modal retrieval of text-image similarity scores based on two independent encoders, with a linear computation complexity.
References
SHOWING 1-10 OF 82 REFERENCES
Sharpness-Aware Minimization for Efficiently Improving Generalization
- Computer ScienceICLR
- 2021
This work introduces a novel, effective procedure for simultaneously minimizing loss value and loss sharpness, Sharpness-Aware Minimization (SAM), which improves model generalization across a variety of benchmark datasets and models, yielding novel state-of-the-art performance for several.
Big Transfer (BiT): General Visual Representation Learning
- Computer ScienceECCV
- 2020
By combining a few carefully selected components, and transferring using a simple heuristic, Big Transfer achieves strong performance on over 20 datasets and performs well across a surprisingly wide range of data regimes -- from 1 example per class to 1M total examples.
Classification is a Strong Baseline for Deep Metric Learning
- Computer ScienceBMVC
- 2019
This paper evaluates on several standard retrieval datasets such as CAR-196, CUB-200-2011, Stanford Online Product, and In-Shop datasets for image retrieval and clustering, and establishes that the classification-based approach is competitive across different feature dimensions and base feature networks.
Microsoft COCO Captions: Data Collection and Evaluation Server
- Computer ScienceArXiv
- 2015
The Microsoft COCO Caption dataset and evaluation server are described and several popular metrics, including BLEU, METEOR, ROUGE and CIDEr are used to score candidate captions.
ImageNet: A large-scale hierarchical image database
- Computer Science2009 IEEE Conference on Computer Vision and Pattern Recognition
- 2009
A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Devise: A deep visualsemantic embedding model
- In Proceedings of Neural Information Processing Systems,
- 2013
Crisscrossed Captions: Extended Intramodal and Intermodal Semantic Similarity Judgments for MS-COCO
- Computer ScienceEACL
- 2021
By supporting multi-modal retrieval training and evaluation, image captioning datasets have spurred remarkable progress on representation learning. Unfortunately, datasets have limited cross-modal…
Self-Training With Noisy Student Improves ImageNet Classification
- Computer Science2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2020
We present a simple self-training method that achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. On…
Learning Transferable Visual Models From Natural Language Supervision
- Computer ScienceICML
- 2021
It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
Learning the Best Pooling Strategy for Visual Semantic Embedding
- Computer Science2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2021
A Generalized Pooling Operator (GPO) is proposed, which learns to automatically adapt itself to the best pooling strategy for different features, requiring no manual tuning while staying effective and efficient and can be a plug-and-play feature aggregation module for standard VSE models.