• Corpus ID: 245218534

RegionCLIP: Region-based Language-Image Pretraining

  title={RegionCLIP: Region-based Language-Image Pretraining},
  author={Yiwu Zhong and Jianwei Yang and Pengchuan Zhang and Chengkun Li and Noel C. F. Codella and Liunian Harold Li and Luowei Zhou and Xiyang Dai and Lu Yuan and Yin Li and Jianfeng Gao},
Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification in both zero-shot and transfer learning settings. However, we show that directly applying such models to recognize image regions for object detection leads to poor performance due to a domain shift: CLIP was trained to match an image as a whole to a text description, without capturing the fine-grained alignment between image regions and text spans. To mitigate this issue… 
Simple Open-Vocabulary Object Detection with Vision Transformers
This paper proposes a strong recipe for transferring image-text models to open-vocabulary object detection using a standard Vision Transformer architecture with minimal modifications, contrastive image- text pre-training, and end-to-end detection fine-tuning.
Localized Vision-Language Matching for Open-vocabulary Object Detection
It is shown that a simple language model is better than a large contextualized language model for detecting novel objects and a consistency-regularization technique to better exploit image-caption pair information is introduced.
Decomposing NeRF for Editing via Feature Field Distillation
This work tackles the problem of semantic scene decomposition of NeRFs to enable query-based local editing of the represented 3D scenes, and distill the knowledge of off-the-shelf, self-supervised 2D image feature extractors into a 3D feature field optimized in parallel to the radiance field.
ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models
Novel language-aware initialization methods are proposed to significantly improve the adaption performance of language-augmented visual models and an automatic hyper-parameter tuning toolkit is developed to ensure the fairness in modelAdaption.
Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks
This work proposes Multimodal Adaptive Distillation (MAD), which adaptively distills useful knowledge from pretrained encoders to cross-modal VLencoders, and introduces an evaluation protocol that includes Visual Commonsense Reasoning (VCR), Visual Entailment (SNLI-VE), and Visual Question Answering (VQA), across a variety of data constraints and conditions of domain shift.


Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
This paper proposes a new learning method Oscar (Object-Semantics Aligned Pre-training), which uses object tags detected in images as anchor points to significantly ease the learning of alignments.
UNITER: UNiversal Image-TExt Representation Learning
UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings.
Zero-Shot Detection via Vision and Language Knowledge Distillation
The goal of this work is to advance zero-shot object detection, which aims to detect novel objects without bounding box nor mask annotations, and proposes ViLD, a training method via Vision and Language knowledge Distillation.
Learning to Generate Scene Graph from Natural Language Supervision
This paper proposes one of the first methods that learn from image-sentence pairs to extract a graphical representation of localized objects and their relationships within an image, known as scene graph, and designs a Transformer-based model to predict these "pseudo" labels via a masked token prediction task.
Learning Visual Representations with Caption Annotations
It is argued that captioned images are easily crawlable and can be exploited to supervise the training of visual representations, and proposed hybrid models, with dedicated visual and textual encoders, show that the visual representations learned as a by-product of solving this task transfer well to a variety of target tasks.
Unified Vision-Language Pre-Training for Image Captioning and VQA
VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions and VQA 2.0.
Learning Transferable Visual Models From Natural Language Supervision
It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
PreDet: Large-scale weakly supervised pre-training for detection
This work proposes a new large-scale pre-training strategy for detection, where noisy class labels are available for all images, but not bounding-boxes, and designs a task that forces bounding boxes with high-overlap to have similar representations in different views of an image, compared to non-overlapping boxes.
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss, and it is shown that the scale of the corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
Fine-Grained Image Classification via Combining Vision and Language
  • Xiangteng He, Yuxin Peng
  • Computer Science
    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2017
The two-stream model combing vision and language (CVL) for learning latent semantic representations is proposed, which demonstrates the CVL approach achieves the best performance on the widely used CUB-200-2011 dataset.