• Corpus ID: 233004347

Towards General Purpose Vision Systems

@article{Gupta2021TowardsGP,
  title={Towards General Purpose Vision Systems},
  author={Tanmay Gupta and Amita Kamath and Aniruddha Kembhavi and Derek Hoiem},
  journal={ArXiv},
  year={2021},
  volume={abs/2104.00743}
}
Computer vision systems today are primarily N-purpose systems, designed and trained for a predefined set of tasks. Adapting such systems to new tasks is challenging and of-ten requires non-trivial modifications to the network architecture (e.g. adding new output heads) or training process (e.g. adding new losses). To reduce the time and exper-tise required to develop new applications, we would like to create general purpose vision systems that can learn and perform a range of tasks without any… 
Vision-Language Intelligence: Tasks, Representation Learning, and Large Models
TLDR
The development in this field is summarized into three time periods, namely task-specific methods, vision-language pre-training (VLP) methods, and larger models empowered by large-scale weakly-labeled data.
Multi-modal Transformers Excel at Class-agnostic Object Detection
TLDR
This paper advocates that existing methods lack a top-down supervision signal governed by human-understandable semantics and develops an efficient and flexible MViT architecture using multi-scale feature processing and deformable self-attention that can adaptively generate proposals given a specific language query.
FindIt: Generalized Localization with Natural Language Queries
TLDR
This work proposes FindIt, a simple and versatile framework that unifies a variety of visual grounding and localization tasks including referring expression comprehension, text-based localization, and object detection, and discovers that a standard object detector is surprisingly ef-fective in unifying these tasks without a need for task-speci-c design, losses, or pre-computed detections.
Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling
TLDR
A vision-language (VL) model that unifies text generation and bounding box prediction into a single architecture that achieves comparable performance to task-specific state of the art on 7 VL benchmarks and shows the capability of generalizing to new tasks such as ImageNet object localization.
Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering
TLDR
This work proposes a novel and simple recipe to pre-train a vision-language joint model, which is multi-task as well, and observes that the proposed approach is able to generalize to unseen tasks and that more diverse mixtures lead to higher accuracy in both known and novel tasks.
Conditional Object-Centric Learning from Video
TLDR
Using the temporal dynamics of video data in the form of optical flow and conditioning the model on simple object location cues can be used to enable segmenting and tracking objects in significantly more realistic synthetic data, which could pave the way for a range of weakly-supervised approaches and allow more effective interaction with trained models.
Cross-Task Generalization via Natural Language Crowdsourcing Instructions
TLDR
This work introduces NATURALINSTRUCTIONS, a dataset of 61 distinct tasks, their human-authored instructions and 193k task instances, and adopts generative pre-trained language models to encode task-specific instructions along with input and generate task output.
VUT: Versatile UI Transformer for Multi-Modal Multi-Task User Interface Modeling
TLDR
VUT is a Versatile UI Transformer that takes multimodal input and simultaneously accomplishes 5 distinct tasks with the same model, reducing the number of models and footprints needed for performing multiple tasks, while achieving accuracy exceeding or on par with baseline models trained for each individual task.
C ONDITIONAL O BJECT -C ENTRIC L EARNING FROM V IDEO
TLDR
Using the temporal dynamics of video data in the form of optical flow and conditioning the model on simple object location cues can be used to enable segmenting and tracking objects in significantly more realistic synthetic data, which could pave the way for a range of weakly-supervised approaches and allow more effective interaction with trained models.
Learning to Solve Complex Tasks by Talking to Agents
TLDR
This work proposes a new benchmark called COMMAQA that contains three kinds of complex reasoning tasks that are designed to be solved by “talking” to four agents with different capabilities and hopes it serves as a novel benchmark to enable the development of “green” AI systems that build upon existing agents.
...
1
2
...

References

SHOWING 1-10 OF 90 REFERENCES
Learning Transferable Visual Models From Natural Language Supervision
TLDR
It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
12-in-1: Multi-Task Vision and Language Representation Learning
TLDR
This work develops a large-scale, multi-task model that culminates in a single model on 12 datasets from four broad categories of task including visual question answering, caption-based image retrieval, grounding referring expressions, and multimodal verification and shows that finetuning task-specific models from this model can lead to further improvements, achieving performance at or above the state-of-the-art.
UberNet: Training a Universal Convolutional Neural Network for Low-, Mid-, and High-Level Vision Using Diverse Datasets and Limited Memory
  • I. Kokkinos
  • Computer Science
    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2017
In this work we train in an end-to-end manner a convolutional neural network (CNN) that jointly handles low-, mid-, and high-level vision tasks in a unified architecture. Such a network can act like
E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning
TLDR
This paper proposes the first end-to-end vision-language pre-trained model for both V+L understanding and generation, namely E2E-VLP, where a unified Transformer framework is built to jointly learn visual representation, and semantic alignments between image and text.
Unified Vision-Language Pre-Training for Image Captioning and VQA
TLDR
VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions and VQA 2.0.
VinVL: Revisiting Visual Representations in Vision-Language Models
TLDR
This paper develops an improved object detection model to provide object-centric representations of images and feeds the visual features generated into a Transformer-based VL fusion model OSCAR, and utilizes an improved approach OSCar+ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks.
VinVL: Making Visual Representations Matter in Vision-Language Models
TLDR
This paper develops an improved object detection model to provide object-centric representations of images and feeds the visual features generated into a Transformer-based VL fusion model OSCAR, and utilizes an improved approach OSCARS to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks.
VisualBERT: A Simple and Performant Baseline for Vision and Language
TLDR
Analysis demonstrates that VisualBERT can ground elements of language to image regions without any explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between verbs and image regions corresponding to their arguments.
Aligned Image-Word Representations Improve Inductive Transfer Across Vision-Language Tasks
TLDR
This paper investigates a vision-language embedding as a core representation and shows that it leads to better cross-task transfer than standard multitask learning and improves visual recognition, especially for categories that have relatively few recognition training labels but appear often in the VQA setting.
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a
...
1
2
3
4
5
...