• Corpus ID: 238583163

Large Scale Audio Understanding without Transformers/ Convolutions/ BERTs/ Mixers/ Attention/ RNNs or

@article{Verma2021LargeSA,
  title={Large Scale Audio Understanding without Transformers/ Convolutions/ BERTs/ Mixers/ Attention/ RNNs or},
  author={Prateek Verma},
  journal={ArXiv},
  year={2021},
  volume={abs/2110.03183}
}
This paper presents a way of doing large scale audio understanding without traditional state of the art neural architectures. Ever since the introduction of deep learning for understanding audio signals in the past decade, convolutional architectures have been able to achieve state of the art results surpassing traditional hand-crafted features. In the recent past, there has been a similar shift away from traditional convolutional and recurrent neural networks towards purely endto-end… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 55 REFERENCES

Audio Transformers: Transformer Architectures For Large Scale Audio Understanding. Adieu Convolutions

This work proposes applying Transformer based architectures without convolutional layers to raw audio signals, and shows how the models learns a non-linear non constant bandwidth filter-bank, which shows an adaptable time frequency front end representation for the task of audio understanding.

AST: Audio Spectrogram Transformer

The Audio Spectrogram Transformer (AST) is introduced, the first convolution-free, purely attention-based model for audio classification, and an approach to transfer knowledge from ImageNet pretrained ViT to AST is proposed.

A Generative Model for Raw Audio Using Transformer Architectures

  • Prateek VermaC. Chafe
  • Computer Science
    2021 24th International Conference on Digital Audio Effects (DAFx)
  • 2021
This paper proposes a novel way of doing audio synthesis at the waveform level using Transformer architectures, and shows how causal transformer generative models can be used for raw waveform synthesis.

A Framework for Contrastive and Generative Learning of Audio Representations

This paper presents a framework for contrastive learning for audio representations, in a self supervised frame work without access to any ground truth labels, and explores generative models based on state of the art transformer based architectures for learning latent spaces for audio signals.

Large-Scale Weakly Supervised Audio Classification Using Gated Convolutional Neural Network

In this paper, we present a gated convolutional neural network and a temporal attention-based localization method for audio classification, which won the 1st place in the large-scale weakly

MLP-Mixer: An all-MLP Architecture for Vision

It is shown that while convolutions and attention are both sufficient for good performance, neither of them are necessary, and MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs), attains competitive scores on image classification benchmarks.

Conditional End-to-End Audio Transforms

An end-to-end method for transforming audio from one style to another based on convolutional and hierarchical recurrent neural networks, designed to capture long-term acoustic dependencies, requires minimal post-processing, and produces realistic audio transforms.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Audio-Based Music Classification with DenseNet And Data Augmentation

This is the first work to apply Densely Connected Convolutional Networks (DenseNet) to music audio tagging, which has been demonstrated to perform better than Residual neural network (ResNet) and the proposed combination of strong representation of DenseNet and data augmentation can be adapted to other audio processing tasks.

CNN architectures for large-scale audio classification

This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.
...