Images Speak in Images: A Generalist Painter for In-Context Visual Learning
@article{Wang2022ImagesSI, title={Images Speak in Images: A Generalist Painter for In-Context Visual Learning}, author={Xinlong Wang and Wen Wang and Yue Cao and Chunhua Shen and Tiejun Huang}, journal={ArXiv}, year={2022}, volume={abs/2212.02499} }
In-context learning, as a new paradigm in NLP, allows the model to rapidly adapt to various tasks with only a handful of prompts and examples. But in computer vision, the difficulties for in-context learning lie in that tasks vary significantly in the output representations, thus it is unclear how to define the general-purpose task prompts that the vi-*
One Citation
Offsite-Tuning: Transfer Learning without Full Model
- Computer ScienceArXiv
- 2023
Offsite-tuning can achieve comparable accuracy as full model fine- Tuning while being privacy-preserving and efficient, achieving 6.5x speedup and 5.6x memory reduction.
References
SHOWING 1-10 OF 55 REFERENCES
Perceiver IO: A General Architecture for Structured Inputs & Outputs
- Computer ScienceICLR
- 2022
The primary focus of this work is generality, rather than speed on images, and Perceiver IO uses comparable FLOPs to attention-based image classification models, especially for the more compact conflguration B pretrained on JFT.
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
- Computer ScienceNeurIPS
- 2019
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a…
Visual Prompting via Image Inpainting
- Computer ScienceArXiv
- 2022
This paper investigates visual prompting : given input-output image example(s) of a new task at test time and a new input image, the goal is to automatically produce the output image, consistent with the given examples, and shows that posing this problem as simple image inpainting turns out to be surprisingly effective.
Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
- Computer ScienceICML
- 2022
Despite its simplicity and relatively small-scale training data, OFA achieves new SOTAs in a series of cross-modal tasks while attaining highly competitive performances on uni- modal tasks.
A Unified Sequence Interface for Vision Tasks
- Computer ScienceArXiv
- 2022
This work shows that a diverse set of “core” computer vision tasks can also be unified if formulated in terms of a shared pixel-to-sequence interface, and shows that one can train a neural network with a single model architecture and loss function on all these tasks, with no task-speciflc customization.
Perceiver: General Perception with Iterative Attention
- Computer ScienceICML
- 2021
This paper introduces the Perceiver – a model that builds upon Transformers and hence makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like ConvNets.
Flamingo: a Visual Language Model for Few-Shot Learning
- Computer ScienceArXiv
- 2022
It is demonstrated that a single Flamingo model can achieve a new state of the art for few-shot learning, simply by prompting the model with task-specific examples.
Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks
- Computer Science2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2022
Results show that the pre-trained model without any tuning can achieve reasonable performance even on novel tasks, and the performance can be improved to a level close to state-of-the-art methods by conducting prompt tuning on 1% of downstream task data.
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
- Computer ScienceICLR
- 2020
A new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT), which adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguistic embedded features as input.
UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes
- Computer ScienceArXiv
- 2022
UViM is a promising candidate for a unified modeling approach in computer vision and the experimental results suggest that it is near state-of-the-art on three diverse and challenging vision tasks.