UniT: Multimodal Multitask Learning with a Unified Transformer

  title={UniT: Multimodal Multitask Learning with a Unified Transformer},
  author={Ronghang Hu and Amanpreet Singh},
  journal={2021 IEEE/CVF International Conference on Computer Vision (ICCV)},
  • Ronghang HuAmanpreet Singh
  • Published 22 February 2021
  • Computer Science
  • 2021 IEEE/CVF International Conference on Computer Vision (ICCV)
We propose UniT, a Unified Transformer model to simultaneously learn the most prominent tasks across different domains, ranging from object detection to natural language understanding and multimodal reasoning. Based on the transformer encoder-decoder architecture, our UniT model encodes each input modality with an encoder and makes predictions on each task with a shared decoder over the encoded input representations, followed by task-specific output heads. The entire model is jointly trained… 

Figures and Tables from this paper

MuIT: An End-to-End Multitask Learning Transformer

We propose an end-to-end Multitask Learning Transformer framework, named MulT, to simultaneously learn multiple high-level vision tasks, including depth estimation, semantic segmentation, reshading,

MTFormer: Multi-task Learning via Transformer and Cross-Task Reasoning

It is demonstrated that models with transformer structures are more ap-propriate for MTL than convolutional neural networks (CNNs), and a novel transformer-based architecture named MTFormer is proposed, which achieves state-of-the-art results with limited network parameters and computations.

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

UNIFIED-IO is the first model capable of performing all 7 tasks on the GRIT benchmark and produces strong results across 16 diverse benchmarks like NYUv2-Depth, ImageNet, VQA2.0, OK-VQA, Swig, VizWizGround, BoolQ, and SciTail, with no task-specific fine-tuning.

Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text

The experiments show that the resultant unified foundation transformer works surprisingly well on both the vision-only and text-only tasks, and the proposed knowledge distillation and gradient masking strategy can effectively lift the performance to approach the level of separately-trained models.

Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks

Results show that the pre-trained model without any tuning can achieve reasonable performance even on novel tasks, and the performance can be improved to a level close to state-of-the-art methods by conducting prompt tuning on 1% of downstream task data.

Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training

It is discovered that sharing parameters leads to semantic concepts from different modalities being encoded more closely in the embedding space, facilitating the transferring of common semantic structure from language to vision.

M3ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design

A model-accelerator co-design framework to enable efficient on-device MTL, that tackles both training and inference bottlenecks and achieves higher accuracies than encoder-focused MTL methods, while reducing 88% inference FLOPs.

GPPF: A General Perception Pre-training Framework via Sparsely Activated Multi-Task Learning

These solid experimental results fully prove the effective knowledge learning, storing, sharing, and transfer provided by the novel GPPF framework.

Multi-Task Learning with Multi-query Transformer for Dense Prediction

This work proposes a simpler pipeline named Multi-Query Transformer (MQTrans-former) that is equipped with multiple queries from different tasks to facilitate the reasoning among multiple tasks and simplify the cross task pipeline.

Bridge-Tower: Building Bridges Between Encoders in Vision-Language Representation Learning

This paper introduces multiple bridge layers that build a connection between the top layers of uni-modal encoders and each layer of the cross- modal encoder, which enables comprehensive bottom-up interactions between visual and textual representations at different semantic levels, resulting in more effective cross-Modal alignment and fusion.



OmniNet: A unified architecture for multi-modal multi-task learning

An extended and unified architecture which can be used for tasks involving a variety of modalities like image, text, videos, etc is introduced and a spatio-temporal cache mechanism that enables learning spatial dimension of the input in addition to the hidden states corresponding to the temporal input sequence is proposed.

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

The LXMERT (Learning Cross-Modality Encoder Representations from Transformers) framework, a large-scale Transformer model that consists of three encoders, achieves the state-of-the-art results on two visual question answering datasets and shows the generalizability of the pre-trained cross-modality model.

Multi-Task Learning with Deep Neural Networks: A Survey

An overview of multi-task learning methods for deep neural networks is given, with the aim of summarizing both the well-established and most recent directions within the field.

Multi-Task Deep Neural Networks for Natural Language Understanding

A Multi-Task Deep Neural Network (MT-DNN) for learning representations across multiple natural language understanding (NLU) tasks that allows domain adaptation with substantially fewer in-domain labels than the pre-trained BERT representations.

Many Task Learning With Task Routing

This paper introduces Many Task Learning (MaTL) as a special case of MTL where more than 20 tasks are performed by a single model and applies a conditional feature-wise transformation over the convolutional activations that enables a model to successfully perform a large number of tasks.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

12-in-1: Multi-Task Vision and Language Representation Learning

This work develops a large-scale, multi-task model that culminates in a single model on 12 datasets from four broad categories of task including visual question answering, caption-based image retrieval, grounding referring expressions, and multimodal verification and shows that finetuning task-specific models from this model can lead to further improvements, achieving performance at or above the state-of-the-art.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Deep multi-task learning with low level tasks supervised at lower layers

It is consistently better to have POS supervision at the innermost rather than the outermost layer, and it is argued that “lowlevel” tasks are better kept at the lower layers, enabling the higher- level tasks to make use of the shared representation of the lower-level tasks.

A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks

A joint many-task model together with a strategy for successively growing its depth to solve increasingly complex tasks and uses a simple regularization term to allow for optimizing all model weights to improve one task’s loss without exhibiting catastrophic interference of the other tasks.