• Corpus ID: 244488430

Sparse Fusion for Multimodal Transformers

  title={Sparse Fusion for Multimodal Transformers},
  author={Yi Ding and Alex Rich and Mason Wang and Noah Stier and Matthew A. Turk and Pradeep Sen and Tobias H{\"o}llerer},
Multimodal classification is a core task in human-centric machine learning. We observe that information is highly complementary across modalities, thus unimodal information can be drastically sparsified prior to multimodal fusion without loss of accuracy. To this end, we present Sparse Fusion Transformers (SFT), a novel multimodal fusion method for transformers that performs comparably to existing state-of-the-art methods while having greatly reduced memory footprint and computation cost. Key… 

Figures and Tables from this paper


Attention Bottlenecks for Multimodal Fusion
This work introduces a novel transformer based architecture that uses ‘fusion bottlenecks’ for modality fusion at multiple layers, and shows that such a strategy improves fusion performance, at the same time reducing computational cost.
Integrating Multimodal Information in Large Pretrained Transformers
Fine-tuning MAG-BERT and MAG-XLNet significantly boosts the sentiment analysis performance over previous baselines as well as language-only fine- Tuning of BERT and XLNet.
Multimodal Transformer for Unaligned Multimodal Language Sequences
Comprehensive experiments on both aligned and non-aligned multimodal time-series show that the MulT model outperforms state-of-the-art methods by a large margin, and empirical analysis suggests that correlated crossmodal signals are able to be captured by the proposed cross modal attention mechanism in MulT.
Training data-efficient image transformers & distillation through attention
This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.
Scalable Vision Transformers with Hierarchical Pooling
A Hierarchical Visual Transformer (HVT) is proposed which progressively pools visual tokens to shrink the sequence length and hence reduces the computational cost, analogous to the feature maps downsampling in Convolutional Neural Networks (CNNs).
Multimodal Machine Learning: A Survey and Taxonomy
This paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy to enable researchers to better understand the state of the field and identify directions for future research.
Augmentation Strategies for Learning with Noisy Labels
This paper proposes and examines multiple augmentation strategies for algorithms tackling the "learning with noisy labels" problem and improves accuracy on the CIFAR-10 benchmark at 90% symmetric noise by more than 15% in absolute accuracy, and improves performance on the Clothing1M dataset.
Convolutional Two-Stream Network Fusion for Video Action Recognition
A new ConvNet architecture for spatiotemporal fusion of video snippets is proposed, and its performance on standard benchmarks where this architecture achieves state-of-the-art results is evaluated.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.