• Corpus ID: 244488430

Sparse Fusion for Multimodal Transformers

  title={Sparse Fusion for Multimodal Transformers},
  author={Yi Ding and Alex Rich and Mason Wang and Noah Stier and Matthew A. Turk and Pradeep Sen and Tobias H{\"o}llerer},
Multimodal classification is a core task in human-centric machine learning. We observe that information is highly complementary across modalities, thus unimodal information can be drastically sparsified prior to multimodal fusion without loss of accuracy. To this end, we present Sparse Fusion Transformers (SFT), a novel multimodal fusion method for transformers that performs comparably to existing state-of-the-art methods while having greatly reduced memory footprint and computation cost. Key… 

Figures and Tables from this paper


Multimodal Transformer for Unaligned Multimodal Language Sequences
Comprehensive experiments on both aligned and non-aligned multimodal time-series show that the MulT model outperforms state-of-the-art methods by a large margin, and empirical analysis suggests that correlated crossmodal signals are able to be captured by the proposed cross modal attention mechanism in MulT.
Training data-efficient image transformers & distillation through attention
This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.
Scalable Vision Transformers with Hierarchical Pooling
A Hierarchical Visual Transformer (HVT) is proposed which progressively pools visual tokens to shrink the sequence length and hence reduces the computational cost, analogous to the feature maps downsampling in Convolutional Neural Networks (CNNs).
Multimodal Machine Learning: A Survey and Taxonomy
This paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy to enable researchers to better understand the state of the field and identify directions for future research.
Convolutional Two-Stream Network Fusion for Video Action Recognition
A new ConvNet architecture for spatiotemporal fusion of video snippets is proposed, and its performance on standard benchmarks where this architecture achieves state-of-the-art results is evaluated.
Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Memory Fusion Network for Multi-view Sequential Learning
A new neural architecture for multi-view sequential learning called the Memory Fusion Network (MFN) that explicitly accounts for both interactions in a neural architecture and continuously models them through time.
Compressive Transformers for Long-Range Sequence Modelling
The Compressive Transformer is presented, an attentive sequence model which compresses past memories for long-range sequence learning and can model high-frequency speech effectively and can be used as a memory mechanism for RL, demonstrated on an object matching task.
Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping
This work proposes a method based on progressive layer dropping that speeds the training of Transformer-based language models, not at the cost of excessive hardware resources but from model architecture change and training technique boosted efficiency.
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.2% on HMDB-51 and 97.9% on UCF-101 after pre-training on Kinetics, and a new Two-Stream Inflated 3D Conv net that is based on 2D ConvNet inflation is introduced.