• Corpus ID: 231879586

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

  title={Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision},
  author={Chao Jia and Yinfei Yang and Ye Xia and Yi-Ting Chen and Zarana Parekh and Hieu Pham and Quoc V. Le and Yun-Hsuan Sung and Zhen Li and Tom Duerig},
  booktitle={International Conference on Machine Learning},
Pre-trained representations are becoming crucial for many NLP and perception tasks. While representation learning in NLP has transitioned to training on raw text without human annotations, visual and vision-language representations still rely heavily on curated training datasets that are expensive or require expert knowledge. For vision applications, representations are mostly learned using datasets with explicit class labels such as ImageNet or OpenImages. For vision-language, popular datasets… 

Inferring Offensiveness In Images From Natural Language Supervision

It is shown that pre-trained transformers themselves provide a methodology for the automated curation of large-scale vision datasets and that one can select relevant prompts for rating the offensiveness of an image.

Unified Visual Relationship Detection with Vision and Language Models

The UniVRD model is proposed, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models (VLMs), which provides well-aligned image and text embeddings, where similar relationships are optimized to be close to each other for semantic unification.

Grounding Language Models to Images for Multimodal Generation

An ef-fective, general solution for leveraging pretrained language models in visually grounded settings, enabling them to process and generate arbitrarily interleaved image-and-text data.

Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text Features

This work presents a simple, yet effective strategy for open-vocabulary temporal action detection utilizing pretrained image-text co-embeddings, and proposes a more reasonable open- Vocabulary evaluation setting for the ActivityNet data set, where the category splits are based on similarity rather than random assignment.

REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory

An end-to-end Retrieval-Augmented Visual Language Model that learns to encode world knowledge into a large-scale memory, and to retrieve from it to answer knowledge-intensive queries and achieves state-of-the-art results on visual question answering and image captioning.

Human alignment of neural network representations

Overall, although models trained on larger, more diverse datasets achieve better alignment with humans than models training on ImageNet alone, the results indicate that scaling alone is unlikely to be sufficient to train neural networks with conceptual representations that match those used by humans.

MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning

This paper proposes a jointly masked multimodal modeling method that achieves state-of-the-art performance on various downstream vision-language tasks, including image-text retrieval, visual question answering, visual reasoning, and weakly-supervised visual grounding.

Generalization Properties of Retrieval-based Models

A formal treatment of retrieval-based models to characterize their generalization ability is presented, and it is shown that breaking down the underlying learning task into local sub-tasks enables the model to employ a low complexity parametric component to ensure good overall accuracy.

ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training

This paper proposes ERNIE-ViL 2.0, a Multi-View Contrastive learning framework to build intra- modal and inter-modal correlations between diverse views simultaneously, aiming at learning a more robust cross-modAL representation.

U-BERT for Fast and Scalable Text-Image Retrieval

A U-BERT model is proposed to achieve an effective and efficient cross-modal retrieval of text-image similarity scores based on two independent encoders, with a linear computation complexity.



Sharpness-Aware Minimization for Efficiently Improving Generalization

This work introduces a novel, effective procedure for simultaneously minimizing loss value and loss sharpness, Sharpness-Aware Minimization (SAM), which improves model generalization across a variety of benchmark datasets and models, yielding novel state-of-the-art performance for several.

Big Transfer (BiT): General Visual Representation Learning

By combining a few carefully selected components, and transferring using a simple heuristic, Big Transfer achieves strong performance on over 20 datasets and performs well across a surprisingly wide range of data regimes -- from 1 example per class to 1M total examples.

Classification is a Strong Baseline for Deep Metric Learning

This paper evaluates on several standard retrieval datasets such as CAR-196, CUB-200-2011, Stanford Online Product, and In-Shop datasets for image retrieval and clustering, and establishes that the classification-based approach is competitive across different feature dimensions and base feature networks.

Microsoft COCO Captions: Data Collection and Evaluation Server

The Microsoft COCO Caption dataset and evaluation server are described and several popular metrics, including BLEU, METEOR, ROUGE and CIDEr are used to score candidate captions.

ImageNet: A large-scale hierarchical image database

A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.

Devise: A deep visualsemantic embedding model

  • In Proceedings of Neural Information Processing Systems,
  • 2013

Crisscrossed Captions: Extended Intramodal and Intermodal Semantic Similarity Judgments for MS-COCO

By supporting multi-modal retrieval training and evaluation, image captioning datasets have spurred remarkable progress on representation learning. Unfortunately, datasets have limited cross-modal

Self-Training With Noisy Student Improves ImageNet Classification

We present a simple self-training method that achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. On

Learning Transferable Visual Models From Natural Language Supervision

It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.

Learning the Best Pooling Strategy for Visual Semantic Embedding

A Generalized Pooling Operator (GPO) is proposed, which learns to automatically adapt itself to the best pooling strategy for different features, requiring no manual tuning while staying effective and efficient and can be a plug-and-play feature aggregation module for standard VSE models.