• Corpus ID: 238419336

ATTENTION IS ALL YOU NEED? GOOD EMBEDDINGS WITH STATISTICS ARE ENOUGH AUDIO UNDERSTANDING WITHOUT CONVOLUTIONS/TRANSFORMERS/BERTS/MIXERS/ATTENTION/RNNS

@inproceedings{Verma2021ATTENTIONIA,
  title={ATTENTION IS ALL YOU NEED? GOOD EMBEDDINGS WITH STATISTICS ARE ENOUGH AUDIO UNDERSTANDING WITHOUT CONVOLUTIONS/TRANSFORMERS/BERTS/MIXERS/ATTENTION/RNNS},
  author={Prateek Verma},
  year={2021}
}
This paper presents a way of doing large scale audio understanding without traditional state of the art neural architectures. Ever since the introduction of deep learning for understanding audio signals in the past decade, convolutional architectures have been able to achieve state of the art results surpassing traditional hand-crafted features. In the recent past, there has been a similar shift away from traditional convolutional and recurrent neural networks towards purely endto-end… 

Figures and Tables from this paper

Improved Vit via knowledge distallation on small datasets

  • Jun WangWeifeng LiuWeishan ZhangBaodi Liu
  • Computer Science
    2022 16th IEEE International Conference on Signal Processing (ICSP)
  • 2022
This work introduced a teacher-student strategy for transformers that relies on a distillation token to ensure that students learn from the teacher through attention and obtains results comparable to the convnets.

Research on Satellite Network Traffic Prediction Based on Improved GRU Neural Network

A satellite network traffic forecasting method with an improved gate recurrent unit (GRU) that combines the attention mechanism with GRU neural network, fully mines the characteristics of self-similarity and long correlation among traffic data sequences, pays attention to the importance of traffic data and hidden state, and learns the time-dependent characteristics of input sequences.

References

SHOWING 1-10 OF 21 REFERENCES

Audio Transformers: Transformer Architectures For Large Scale Audio Understanding. Adieu Convolutions

This work proposes applying Transformer based architectures without convolutional layers to raw audio signals, and shows how the models learns a non-linear non constant bandwidth filter-bank, which shows an adaptable time frequency front end representation for the task of audio understanding.

A Generative Model for Raw Audio Using Transformer Architectures

  • Prateek VermaC. Chafe
  • Computer Science
    2021 24th International Conference on Digital Audio Effects (DAFx)
  • 2021
This paper proposes a novel way of doing audio synthesis at the waveform level using Transformer architectures, and shows how causal transformer generative models can be used for raw waveform synthesis.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

A Framework for Contrastive and Generative Learning of Audio Representations

This paper presents a framework for contrastive learning for audio representations, in a self supervised frame work without access to any ground truth labels, and explores generative models based on state of the art transformer based architectures for learning latent spaces for audio signals.

MLP-Mixer: An all-MLP Architecture for Vision

It is shown that while convolutions and attention are both sufficient for good performance, neither of them are necessary, and MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs), attains competitive scores on image classification benchmarks.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

Conditional End-to-End Audio Transforms

An end-to-end method for transforming audio from one style to another based on convolutional and hierarchical recurrent neural networks, designed to capture long-term acoustic dependencies, requires minimal post-processing, and produces realistic audio transforms.

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

Unsupervised Learning of Audio Perception for Robotics Applications: Learning to Project Data to T-SNE/UMAP space

This paper builds upon key ideas to build perception of touch sounds without access to any ground-truth data and shows how to leverage ideas from classical signal processing to get large amounts of data of any sound of interest with a high precision.