• Corpus ID: 244799261

Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks

  title={Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks},
  author={Xizhou Zhu and Jinguo Zhu and Hao Li and Xiaoshi Wu and Xiaogang Wang and Hongsheng Li and Xiaohua Wang and Jifeng Dai},
Biological intelligence systems of animals perceive the world by integrating information in different modalities and processing simultaneously for various tasks. In contrast, current machine learning research follows a task-specific paradigm, leading to inefficient collaboration between tasks and high marginal costs of developing perception models for new tasks. In this paper, we present a generic perception architecture named Uni-Perceiver, which processes a variety of modalities and tasks… 
Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs
By incorporating the proposed Conditional MoEs, the recently proposed generalist model Uni-Perceiver can effectively mitigate the interference across tasks and modalities, and achieves state-of-the-art results on a series of downstream tasks via prompt tuning on 1% of downstream data.
Vision Transformer Adapter for Dense Predictions
This work proposes a Vision Transformer Adapter (ViT-Adapter), which can remedy the defects of ViT and achieve comparable performance to vision-specific models by introducing inductive biases via an additional architecture.
Flamingo: a Visual Language Model for Few-Shot Learning
It is demonstrated that a single Flamingo model can achieve a new state of the art for few-shot learning, simply by prompting the model with task-specific examples.
Backbone is All Your Need: A Simplified Architecture for Visual Object Tracking
A Simplified Tracking architecture (SimTrack) by leveraging a transformer backbone for joint feature extraction and interaction, and proposes a foveal window strategy, providing more diverse input patches with acceptable computational costs.
Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners
The goal of this work is to build flexible video-language models that can generalize to various video-to-text tasks from few examples, such as domain-specific captioning, question answering, and future event prediction, which outperforms state-of-the-art supervised models trained on any video datasets.
Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning
This work argues that by using visual clues to bridge large pretrained vision foundation models and language models, they can do so without any extra cross-modal training.
Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product Retrieval
A novel Entity-Graph Enhanced Cross-Modal Pretraining (EGE-CMP) model is proposed for instance-level commodity retrieval, that explicitly injects entity knowledge in both node-based and subgraph-based ways into the multi-modal networks via a self-supervised hybrid-stream transformer, which could reduce the confusion between different object contents.
A Prompt Array Keeps the Bias Away: Debiasing Vision-Language Models with Adversarial Learning
This paper evaluates different bias measures and proposes the use of retrieval metrics to image-text representations via a bias measuring framework and investigates debiasing methods, showing that optimizing for adversarial loss via learnable token embeddings minimizes various bias measures without substantially degrading feature representations.


UniT: Multimodal Multitask Learning with a Unified Transformer
UniT, a Unified Transformer model to simultaneously learn the most prominent tasks across different domains, ranging from object detection to natural language understanding and multimodal reasoning, achieves strong performance on each task with significantly fewer parameters.
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
After pretraining on large-scale image-caption pairs, Unicoder-VL is transferred to caption-based image-text retrieval and visual commonsense reasoning, with just one additional output layer, and shows the powerful ability of the cross-modal pre-training.
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a
Unified Vision-Language Pre-Training for Image Captioning and VQA
VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions and VQA 2.0.
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
A minimal VLP model, Vision-andLanguage Transformer (ViLT), monolithic in the sense that processing of visual inputs is drastically simplified to just the same convolution-free manner that the authors process textual inputs, showing that ViLT is up to 60 times faster than previous VLP models, yet with competitive or better downstream task performance.
Learning to Prompt for Vision-Language Models
Context Optimization (CoOp) is proposed, a simple approach specifically for adapting CLIP-like vision-language models for downstream image recognition that requires as few as one or two shots to beat hand-crafted prompts with a decent margin and is able to gain significant improvements when using more shots.
Learning Transferable Visual Models From Natural Language Supervision
It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
The convolution-free VATT outperforms state-of-the-art ConvNet-based architectures in the downstream tasks and shows the generalizability of the model despite the domain gap between videos and images.
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
The LXMERT (Learning Cross-Modality Encoder Representations from Transformers) framework, a large-scale Transformer model that consists of three encoders, achieves the state-of-the-art results on two visual question answering datasets and shows the generalizability of the pre-trained cross-modality model.
CLIP-Adapter: Better Vision-Language Models with Feature Adapters
This paper shows that there is an alternative path to achieve better vision-language models other than prompt tuning, and proposes CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch.