Images Speak in Images: A Generalist Painter for In-Context Visual Learning

  title={Images Speak in Images: A Generalist Painter for In-Context Visual Learning},
  author={Xinlong Wang and Wen Wang and Yue Cao and Chunhua Shen and Tiejun Huang},
In-context learning, as a new paradigm in NLP, allows the model to rapidly adapt to various tasks with only a handful of prompts and examples. But in computer vision, the difficulties for in-context learning lie in that tasks vary significantly in the output representations, thus it is unclear how to define the general-purpose task prompts that the vi-* 

Figures and Tables from this paper

Offsite-Tuning: Transfer Learning without Full Model

Offsite-tuning can achieve comparable accuracy as full model fine- Tuning while being privacy-preserving and efficient, achieving 6.5x speedup and 5.6x memory reduction.



Perceiver IO: A General Architecture for Structured Inputs & Outputs

The primary focus of this work is generality, rather than speed on images, and Perceiver IO uses comparable FLOPs to attention-based image classification models, especially for the more compact conflguration B pretrained on JFT.

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a

Visual Prompting via Image Inpainting

This paper investigates visual prompting : given input-output image example(s) of a new task at test time and a new input image, the goal is to automatically produce the output image, consistent with the given examples, and shows that posing this problem as simple image inpainting turns out to be surprisingly effective.

Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Despite its simplicity and relatively small-scale training data, OFA achieves new SOTAs in a series of cross-modal tasks while attaining highly competitive performances on uni- modal tasks.

A Unified Sequence Interface for Vision Tasks

This work shows that a diverse set of “core” computer vision tasks can also be unified if formulated in terms of a shared pixel-to-sequence interface, and shows that one can train a neural network with a single model architecture and loss function on all these tasks, with no task-speciflc customization.

Perceiver: General Perception with Iterative Attention

This paper introduces the Perceiver – a model that builds upon Transformers and hence makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like ConvNets.

Flamingo: a Visual Language Model for Few-Shot Learning

It is demonstrated that a single Flamingo model can achieve a new state of the art for few-shot learning, simply by prompting the model with task-specific examples.

Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks

Results show that the pre-trained model without any tuning can achieve reasonable performance even on novel tasks, and the performance can be improved to a level close to state-of-the-art methods by conducting prompt tuning on 1% of downstream task data.

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

A new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT), which adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguistic embedded features as input.

UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes

UViM is a promising candidate for a unified modeling approach in computer vision and the experimental results suggest that it is near state-of-the-art on three diverse and challenging vision tasks.