• Corpus ID: 244799261

Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks

  title={Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks},
  author={Xizhou Zhu and Jinguo Zhu and Hao Li and Xiaoshi Wu and Xiaogang Wang and Hongsheng Li and Xiaohua Wang and Jifeng Dai},
Biological intelligence systems of animals perceive the world by integrating information in different modalities and processing simultaneously for various tasks. In contrast, current machine learning research follows a task-specific paradigm, leading to inefficient collaboration between tasks and high marginal costs of developing perception models for new tasks. In this paper, we present a generic perception architecture named Uni-Perceiver, which processes a variety of modalities and tasks… 
Flamingo: a Visual Language Model for Few-Shot Learning
It is demonstrated that a single Flamingo model can achieve a new state of the art for few-shot learning, simply by prompting the model with task-specific examples.
Backbone is All Your Need: A Simplified Architecture for Visual Object Tracking
A Simplified Tracking architecture (SimTrack) by leveraging a transformer backbone for joint feature extraction and interaction, and proposes a foveal window strategy, providing more diverse input patches with acceptable computational costs.
A Prompt Array Keeps the Bias Away: Debiasing Vision-Language Models with Adversarial Learning
This paper evaluates different bias measures and proposes the use of retrieval metrics to image-text representations via a bias measuring framework and investigates debiasing methods, showing that optimizing for adversarial loss via learnable token embeddings minimizes various bias measures without substantially degrading feature representations.
Vision Transformer Adapter for Dense Predictions
. This work investigates a simple yet powerful adapter for Vision Transformer (ViT). Unlike recent visual transformers that introduce vision-specific inductive biases into their architectures, ViT


UniT: Multimodal Multitask Learning with a Unified Transformer
UniT, a Unified Transformer model to simultaneously learn the most prominent tasks across different domains, ranging from object detection to natural language understanding and multimodal reasoning, achieves strong performance on each task with significantly fewer parameters.
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
After pretraining on large-scale image-caption pairs, Unicoder-VL is transferred to caption-based image-text retrieval and visual commonsense reasoning, with just one additional output layer, and shows the powerful ability of the cross-modal pre-training.
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a
Unified Vision-Language Pre-Training for Image Captioning and VQA
VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions and VQA 2.0.
Learning to Prompt for Vision-Language Models
Context Optimization (CoOp) is proposed, a simple approach specifically for adapting CLIP-like vision-language models for downstream image recognition that requires as few as one or two shots to beat hand-crafted prompts with a decent margin and is able to gain significant improvements when using more shots.
Learning Transferable Visual Models From Natural Language Supervision
It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
The LXMERT (Learning Cross-Modality Encoder Representations from Transformers) framework, a large-scale Transformer model that consists of three encoders, achieves the state-of-the-art results on two visual question answering datasets and shows the generalizability of the pre-trained cross-modality model.
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
The convolution-free VATT outperforms state-of-the-art ConvNet-based architectures in the downstream tasks and sets a new record on waveform-based audio event recognition, showing the generalizability of the model despite the domain gap between videos and images.
CLIP-Adapter: Better Vision-Language Models with Feature Adapters
This paper shows that there is an alternative path to achieve better vision-language models other than prompt tuning, and proposes CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch.
VisualBERT: A Simple and Performant Baseline for Vision and Language
Analysis demonstrates that VisualBERT can ground elements of language to image regions without any explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between verbs and image regions corresponding to their arguments.