• Corpus ID: 233004347

Towards General Purpose Vision Systems

@article{Gupta2021TowardsGP,
  title={Towards General Purpose Vision Systems},
  author={Tanmay Gupta and Amita Kamath and Aniruddha Kembhavi and Derek Hoiem},
  journal={ArXiv},
  year={2021},
  volume={abs/2104.00743}
}
Computer vision systems today are primarily N-purpose systems, designed and trained for a predefined set of tasks. Adapting such systems to new tasks is challenging and of-ten requires non-trivial modifications to the network architecture (e.g. adding new output heads) or training process (e.g. adding new losses). To reduce the time and exper-tise required to develop new applications, we would like to create general purpose vision systems that can learn and perform a range of tasks without any… 
Class-agnostic Object Detection with Multi-modal Transformer
TLDR
This paper advocates that existing methods lack a top-down supervision signal governed by human-understandable semantics and demonstrates that Multi-modal Vision Transformers (MViT) trained with aligned image-text pairs can effec-tively bridge this gap.
Vision-Language Intelligence: Tasks, Representation Learning, and Large Models
TLDR
The development in this field is summarized into three time periods, namely task-specific methods, vision-language pre-training (VLP) methods, and larger models empowered by large-scale weakly-labeled data.
Multi-modal Transformers Excel at Class-agnostic Object Detection
TLDR
This paper advocates that existing methods lack a top-down supervision signal governed by human-understandable semantics and develops an efficient and flexible MViT architecture using multi-scale feature processing and deformable self-attention that can adaptively generate proposals given a specific language query.
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
TLDR
This work presents FIBER (Fusion-In-the-Backbone-based transformER), a new VL model architecture that can seamlessly handle both these types of tasks and provides consistent performance improvements over strong baselines across all tasks, often outperforming methods using magnitudes more data.
Webly Supervised Concept Expansion for General Purpose Vision Models
TLDR
GPV-2 is proposed, a new architecture that supports a variety of tasks – from vision tasks like classification and localization to vision+language tasks like QA and captioning to more niche ones like human-object interaction recognition.
FindIt: Generalized Localization with Natural Language Queries
TLDR
This work proposes FindIt, a simple and versatile framework that unifies a variety of visual grounding and localization tasks including referring expression comprehension, text-based localization, and object detection, and discovers that a standard object detector is surprisingly ef-fective in unifying these tasks without a need for task-speci-c design, losses, or pre-computed detections.
PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models
TLDR
It is shown that PEVL enables state-of-the-art performance of detector-free VLP models on position-sensitive tasks such as referring expression comprehension and phrase grounding, and also improves the performance on Position-insensitive tasks with grounded inputs.
Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling
TLDR
A vision-language (VL) model that unifies text generation and bounding box prediction into a single architecture that achieves comparable performance to task-specific state of the art on 7 VL benchmarks and shows the capability of generalizing to new tasks such as ImageNet object localization.
Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering
TLDR
This work proposes a novel and simple recipe to pre-train a vision-language joint model, which is multi-task as well, and observes that the proposed approach is able to generalize to unseen tasks and that more diverse mixtures lead to higher accuracy in both known and novel tasks.
Conditional Object-Centric Learning from Video
TLDR
Using the temporal dynamics of video data in the form of optical flow and conditioning the model on simple object location cues can be used to enable segmenting and tracking objects in significantly more realistic synthetic data, which could pave the way for a range of weakly-supervised approaches and allow more effective interaction with trained models.
...
...

References

SHOWING 1-10 OF 90 REFERENCES
Learning Transferable Visual Models From Natural Language Supervision
TLDR
It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
12-in-1: Multi-Task Vision and Language Representation Learning
TLDR
This work develops a large-scale, multi-task model that culminates in a single model on 12 datasets from four broad categories of task including visual question answering, caption-based image retrieval, grounding referring expressions, and multimodal verification and shows that finetuning task-specific models from this model can lead to further improvements, achieving performance at or above the state-of-the-art.
UberNet: Training a Universal Convolutional Neural Network for Low-, Mid-, and High-Level Vision Using Diverse Datasets and Limited Memory
  • Iasonas Kokkinos
  • Computer Science
    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2017
In this work we train in an end-to-end manner a convolutional neural network (CNN) that jointly handles low-, mid-, and high-level vision tasks in a unified architecture. Such a network can act like
E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning
TLDR
This paper proposes the first end-to-end vision-language pre-trained model for both V+L understanding and generation, namely E2E-VLP, where a unified Transformer framework is built to jointly learn visual representation, and semantic alignments between image and text.
Unifying Vision-and-Language Tasks via Text Generation
TLDR
This work proposes a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where the models learn to generate labels in text based on the visual and textual inputs.
Unified Vision-Language Pre-Training for Image Captioning and VQA
TLDR
VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions and VQA 2.0.
VinVL: Revisiting Visual Representations in Vision-Language Models
TLDR
This paper develops an improved object detection model to provide object-centric representations of images and feeds the visual features generated into a Transformer-based VL fusion model OSCAR, and utilizes an improved approach OSCar+ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks.
VinVL: Making Visual Representations Matter in Vision-Language Models
TLDR
This paper develops an improved object detection model to provide object-centric representations of images and feeds the visual features generated into a Transformer-based VL fusion model OSCAR, and utilizes an improved approach OSCARS to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks.
VisualBERT: A Simple and Performant Baseline for Vision and Language
TLDR
Analysis demonstrates that VisualBERT can ground elements of language to image regions without any explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between verbs and image regions corresponding to their arguments.
Aligned Image-Word Representations Improve Inductive Transfer Across Vision-Language Tasks
TLDR
This paper investigates a vision-language embedding as a core representation and shows that it leads to better cross-task transfer than standard multitask learning and improves visual recognition, especially for categories that have relatively few recognition training labels but appear often in the VQA setting.
...
...