• Corpus ID: 246608142

Webly Supervised Concept Expansion for General Purpose Vision Models

  title={Webly Supervised Concept Expansion for General Purpose Vision Models},
  author={Amita Kamath and Christopher Clark and Tanmay Gupta and Eric Kolve and Derek Hoiem and Aniruddha Kembhavi},
. General Purpose Vision (GPV) systems are models that are designed to solve a wide array of visual tasks without requiring architectural changes. Today, GPVs primarily learn both skills and concepts from large fully supervised datasets. Scaling GPVs to tens of thousands of concepts by acquiring data to learn each concept for every skill quickly becomes prohibitive. This work presents an effective and inexpensive alternative: learn skills from supervised datasets, learn concepts from web image… 
GRIT: General Robust Image Task Benchmark
The General Robust Image Task (GRIT) benchmark is introduced, providing a platform for thor-ough assessment of skills and concepts learned by a vision model and catalyzes the development of perfor-mant and robust general purpose vision systems.
A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge
This work introduces A-OKVQA, a crowdsourced dataset composed of a diverse set of about 25K questions requiring a broad base of commonsense and world knowledge to answer, and demonstrates the potential of this new dataset through a detailed analysis of its contents and baseline performance measurements over a variety of state-of-the-art vision–language models.


nocaps: novel object captioning at scale
This work presents the first large-scale benchmark for novel object captioning at scale, ‘nocaps’, consisting of 166,100 human-generated captions describing 15,100 images from the Open Images validation and test sets and provides analysis to guide future work.
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning
We present a new dataset of image caption annotations, Conceptual Captions, which contains an order of magnitude more images than the MS-COCO dataset (Lin et al., 2014) and represents a wider variety
Learning to Detect Human-Object Interactions
Experiments demonstrate that the proposed Human-Object Region-based Convolutional Neural Networks (HO-RCNN), by exploiting human-object spatial relations through Interaction Patterns, significantly improves the performance of HOI detection over baseline approaches.
HICO: A Benchmark for Recognizing Human-Object Interactions in Images
An in-depth analysis of representative current approaches is performed and it is shown that DNNs enjoy a significant edge and that semantic knowledge can significantly improve HOI recognition, especially for uncommon categories.
The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection
  • 2018
Towards General Purpose Vision Systems
GPV-1 is proposed, a task-agnostic vision-language architecture that can learn and perform tasks that involve receiving an image and producing text and/or bounding boxes, including classification, localization, visual question answering, captioning, and more.
VinVL: Making Visual Representations Matter in Vision-Language Models
This paper develops an improved object detection model to provide object-centric representations of images and feeds the visual features generated into a Transformer-based VL fusion model OSCAR, and utilizes an improved approach OSCARS to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks.
End-to-End Object Detection with Transformers
This work presents a new method that views object detection as a direct set prediction problem, and demonstrates accuracy and run-time performance on par with the well-established and highly-optimized Faster RCNN baseline on the challenging COCO object detection dataset.
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
This work balances the popular VQA dataset by collecting complementary images such that every question in the authors' balanced dataset is associated with not just a single image, but rather a pair of similar images that result in two different answers to the question.
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
The Visual Genome dataset is presented, which contains over 108K images where each image has an average of $$35$$35 objects, $$26$$26 attributes, and $$21$$21 pairwise relationships between objects, and represents the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answer pairs.