Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts

  title={Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts},
  author={Basil Mustafa and Carlos Riquelme and Joan Puigcerver and Rodolphe Jenatton and Neil Houlsby},
Large sparsely-activated models have obtained excellent performance in multiple domains. However, such models are typically trained on a single modality at a time. We present the Language-Image MoE, LIMoE , a sparse mixture of experts model capable of multimodal learning. LIMoE accepts both images and text simultaneously, while being trained using a contrastive loss. MoEs are a natural fit for a multimodal backbone, since expert layers can learn an appropriate partitioning of modalities. However… 

A Review of Sparse Expert Models in Deep Learning

The concept of sparse expert models is reviewed, a basic description of the common algorithms is provided, the advances in the deep learning era are contextualized, and areas for future work are highlighted.

Higher Cognition: A Mechanical Perspective

Cognition is the acquisition of knowledge by the mechanical process of information flow in a system. In cognition, input is received by the sensory modalities and the output may occur as a motor or



A Simple Framework for Contrastive Learning of Visual Representations

It is shown that composition of data augmentations plays a critical role in defining effective predictive tasks, and introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning.

Scaling Vision with Sparse Mixture of Experts

This work presents a Vision MoE, a sparse version of the Vision Transformer that is scalable and competitive with the largest dense networks, and proposes an extension to the routing algorithm that can prioritize subsets of each input across the entire batch, leading to adaptive per-image compute.

DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning

DSelect-k is developed: the first, continuously differentiable and sparse gate for MoE, based on a novel binary encoding formulation, that can be trained using first-order methods, such as stochastic gradient descent, and offers explicit control over the number of experts to select.

ST-MoE: Designing Stable and Transferable Sparse Expert Models

This work concludes by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixtureof-Experts or ST-MoE-32B), and achieves state-of-the-art performance in transfer learning.

PolyViT: Co-training Vision Transformers on Images, Videos and Audio

Co-training PolyViT on multiple modalities and tasks leads to a model that is even more parameter-efficient, and learns representations that generalize across multiple domains, as well as simple and practical to implement.

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

The convolution-free VATT outperforms state-of-the-art ConvNet-based architectures in the downstream tasks and shows the generalizability of the model despite the domain gap between videos and images.

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

This work introduces a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks, and applies the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora.

UFO: A UniFied TransfOrmer for Vision-Language Representation Learning

A single UniFied transfOrmer (UFO), which is capable of processing either unimodal inputs or multi-modality inputs, for vision-language (VL) representation learning, to achieve new state of the arts on visual question answering, COCO image captioning and nocaps.

Contrastive Learning of Medical Visual Representations from Paired Images and Text

This work proposes an alternative unsupervised strategy to learn medical visual representations directly from the naturally occurring pairing of images and textual data, and shows that this method leads to image representations that considerably outperform strong baselines in most settings.

Supervised Contrastive Learning

A novel training methodology that consistently outperforms cross entropy on supervised learning tasks across different architectures and data augmentations is proposed, and the batch contrastive loss is modified, which has recently been shown to be very effective at learning powerful representations in the self-supervised setting.