SepTr: Separable Transformer for Audio Spectrogram Processing

  title={SepTr: Separable Transformer for Audio Spectrogram Processing},
  author={Nicolae-Catalin Ristea and Radu Tudor Ionescu and Fahad Shahbaz Khan},
Following the successful application of vision transformers in multiple computer vision tasks, these models have drawn the attention of the signal processing community. This is because signals are often represented as spectrograms (e.g. through Discrete Fourier Transform) which can be directly provided as input to vision transformers. However, naively applying transformers to spectrograms is suboptimal. Since the axes represent distinct dimensions, i.e. frequency and time, we argue that a… 

Figures and Tables from this paper

Sound Classification and Processing of Urban Environments: A Systematic Literature Review

It can be realized that Deep Learning architectures, attention mechanisms, data augmentation techniques, and pretraining are the most crucial factors to consider while creating an efficient sound classification model.

AHD ConvNet for Speech Emotion Classification

This work proposes a novel mel spectrogram learning approach in which the model uses the datapoints to learn emotions from the given wav form voice notes in the popular CREMA-D dataset, taking less training time compared to other approaches used to address the problem of emotion speech recognition.

Transformers for Urban Sound Classification—A Comprehensive Performance Evaluation

Many relevant sound events occur in urban scenarios, and robust classification models are required to identify abnormal and relevant events correctly. These models need to identify such events within

LeRaC: Learning Rate Curriculum

This work proposes a novel curriculum learning approach termed LeRaC, which leverages the use of a different learning rate for each layer of a neural network to create a data-free curriculum during the initial training epochs, generally outperforming CBS by significant margins.



AST: Audio Spectrogram Transformer

The Audio Spectrogram Transformer (AST) is introduced, the first convolution-free, purely attention-based model for audio classification, and an approach to transfer knowledge from ImageNet pretrained ViT to AST is proposed.

Transformers in Vision: A Survey

This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline with an introduction to fundamental concepts behind the success of Transformers, i.e., self-attention, large-scale pre-training, and bidirectional feature encoding.

CvT: Introducing Convolutions to Vision Transformers

A new architecture is presented that improves Vision Transformer (ViT) in performance and efficiency by introducing convolutions into ViT to yield the best of both de-signs, and the positional encoding, a crucial component in existing Vision Transformers, can be safely re-moved in this model.

Training data-efficient image transformers & distillation through attention

This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.

Transformer-Based Acoustic Modeling for Streaming Speech Synthesis

A transformer-based acoustic model that is constant-speed regardless of input sequence length, making it ideal for streaming speech synthesis applications and providing improved speech naturalness for long utterances is proposed.

Unsupervised Filterbank Learning Using Convolutional Restricted Boltzmann Machine for Environmental Sound Classification

Using CNN classifier, the ConvRBM filterbank and its score-level fusion with the Mel filterbank energies (FBEs) gave an absolute improvement of 10.65 %, and 18.70 % in the classification accuracy, respectively, over FBEs alone on the ESC-50 database, shows that the proposed ConvR BM filterbank also contains highly complementary information over the Mel filters, which is helpful in the ESC task.

Online Compressive Transformer for End-to-End Speech Recognition

An online transformer for real-time speech recognition where online transcription is generated chunk by chuck and this OCT does not only obtain comparable performance with offline transformer, but also work faster than the baseline model.

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

Conformer: Convolution-augmented Transformer for Speech Recognition

This work proposes the convolution-augmented transformer for speech recognition, named Conformer, which significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies.

End-to-End Speaker-Attributed ASR with Transformer

This paper thoroughly update the model architecture that was previously designed based on a long short-term memory (LSTM)-based attention encoder decoder by applying transformer architectures, and proposes a speaker deduplication mechanism to reduce speaker identification errors in highly overlapped regions.