• Corpus ID: 239024736

SSAST: Self-Supervised Audio Spectrogram Transformer

@article{Gong2021SSASTSA,
  title={SSAST: Self-Supervised Audio Spectrogram Transformer},
  author={Yuan Gong and Cheng-I Jeff Lai and Yu-An Chung and James R. Glass},
  journal={ArXiv},
  year={2021},
  volume={abs/2110.09784}
}
Recently, neural networks based purely on self-attention, such as the Vision Transformer (ViT), have been shown to outperform deep learning models constructed with convolutional neural networks (CNNs) on various vision tasks, thus extending the success of Transformers, which were originally developed for language processing, to the vision domain. A recent study (Gong, Chung, and Glass 2021) showed that a similar methodology can also be applied to the audio domain. Specifically, the Audio… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 32 REFERENCES
Training data-efficient image transformers & distillation through attention
TLDR
This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.
Emerging Properties in Self-Supervised Vision Transformers
TLDR
This paper questions if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets), and implements DINO, a form of self-distillation with no labels, which is implemented into a simple self- supervised method.
Contrastive Learning of General-Purpose Audio Representations
TLDR
This work builds on top of recent advances in contrastive learning for computer vision and reinforcement learning to design a lightweight, easy-to-implement self-supervised model of audio, and shows that despite its simplicity, this method significantly outperforms previous self- supervised systems.
BEiT: BERT Pre-Training of Image Transformers
TLDR
A self-supervised vision representation model BEIT, which stands for Bidirectional Encoder representation from Image Transformers, is introduced and Experimental results on image classification and semantic segmentation show that the model achieves competitive results with previous pre-training methods.
Selfie: Self-supervised Pretraining for Image Embedding
TLDR
The pretraining technique called Selfie, which stands for SELFie supervised Image Embedding, generalizes the concept of masked language modeling of BERT to continuous data, such as images, by making use of the Contrastive Predictive Coding loss.
Voxceleb: Large-scale speaker verification in the wild
TLDR
A very large-scale audio-visual dataset collected from open source media using a fully automated pipeline and developed and compared different CNN architectures with various aggregation methods and training loss functions that can effectively recognise identities from voice under various conditions are introduced.
wav2vec: Unsupervised Pre-training for Speech Recognition
TLDR
Wav2vec is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training and outperforms Deep Speech 2, the best reported character-based system in the literature while using two orders of magnitude less labeled training data.
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Bag of Tricks for Image Classification with Convolutional Neural Networks
TLDR
This paper examines a collection of training procedure refinements and empirically evaluates their impact on the final model accuracy through ablation study, and shows that by combining these refinements together, they are able to improve various CNN models significantly.
Learning from Between-class Examples for Deep Sound Recognition
TLDR
The experimental results show that BC learning improves the performance on various sound recognition networks, datasets, and data augmentation schemes, in which BC learning proves to be always beneficial.
...
1
2
3
4
...