Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs

  title={Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs},
  author={Jinguo Zhu and Xizhou Zhu and Wenhai Wang and Xiaohua Wang and Hongsheng Li and Xiaogang Wang and Jifeng Dai},
To build an artificial neural network like the biological intelligence system, recent works have unified numerous tasks into a generalist model, which can process various tasks with shared parameters and do not have any task-specific modules. While generalist models achieve promising results on various benchmarks, they have performance degradation on some tasks compared with task-specialized models. In this work, we find that interference among different tasks and modalities is the main factor to… 

Goal-oriented Autonomous Driving

UniAD is introduced, the first comprehensive framework up-to-date that incorporates full-stack driving tasks in one network and is exquisitely devised to leverage advan-tages of each module, and provide complementary feature abstractions for agent interaction from a global perspective.

Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks

Extensive experiments on benchmark datasets show that X-FM can significantly outperform existing general foundation models and perform better than or comparable to existing foundation models specifically for language, vision, or vision-language understanding.

Vision Transformer Adapter for Dense Predictions

This work proposes a Vision Transformer Adapter (ViT-Adapter), which can remedy the defects of ViT and achieve comparable performance to vision-specific models by introducing inductive biases via an additional architecture.

ViTPose+: Vision Transformer Foundation Model for Generic Body Pose Estimation

This paper shows the surprisingly good properties of plain vision transformers for body pose estimation from various aspects, namely simplicity in model structure, scalability in model size, flexibility in training paradigm, and transferability of knowledge between models, through a simple baseline model dubbed ViTPose.

Delving into the Devils of Bird's-eye-view Perception: A Review, Evaluation and Recipe

A full suite of practical guidebook to improve the performance of BEV perception tasks, including camera, LiDAR and fusion inputs are introduced, and the future research directions in this area are pointed out.

MoE-Fusion: Instance Embedded Mixture-of-Experts for Infrared and Visible Image Fusion

This work proposes a novel framework with instance embedded Mixture-of-Experts for infrared and visible image fusion, termed MoE-Fusion, which contains an instance embedded MoE group (IE-MoE), anMoE-Decoder, two encoders, and two auxiliary detection networks.



Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks

Results show that the pre-trained model without any tuning can achieve reasonable performance even on novel tasks, and the performance can be improved to a level close to state-of-the-art methods by conducting prompt tuning on 1% of downstream task data.

Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Despite its simplicity and relatively small-scale training data, OFA achieves new SOTAs in a series of cross-modal tasks while attaining highly competitive performances on uni- modal tasks.

Reparameterizing Convolutions for Incremental Multi-Task Learning without Task Interference

The reparameterization enables the model to learn new tasks without adversely affecting the performance of existing ones and achieves state-of-the-art on two challenging multi-task learning benchmarks, PASCAL-Context and NYUD, and also demonstrates superior incremental learning capability as compared to its close competitors.

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

This work introduces a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks, and applies the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora.

Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference

This work investigates routing strategies at different granularity (token, sentence, task) in MoE models to bypass distillation and suggests that task-level routing ( task-MoE ) en-ables us to extract smaller, ready-to-deploy sub-networks from large sparse models.

12-in-1: Multi-Task Vision and Language Representation Learning

This work develops a large-scale, multi-task model that culminates in a single model on 12 datasets from four broad categories of task including visual question answering, caption-based image retrieval, grounding referring expressions, and multimodal verification and shows that finetuning task-specific models from this model can lead to further improvements, achieving performance at or above the state-of-the-art.

Scaling Vision with Sparse Mixture of Experts

This work presents a Vision MoE, a sparse version of the Vision Transformer that is scalable and competitive with the largest dense networks, and proposes an extension to the routing algorithm that can prioritize subsets of each input across the entire batch, leading to adaptive per-image compute.

Perceiver: General Perception with Iterative Attention

This paper introduces the Perceiver – a model that builds upon Transformers and hence makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like ConvNets.

Which Tasks Should Be Learned Together in Multi-task Learning?

This framework offers a time-accuracy trade-off and can produce better accuracy using less inference time than not only a single large multi-task neural network but also many single-task networks.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.