• Corpus ID: 233481116

Audio Transformers: Transformer Architectures For Large Scale Audio Understanding. Adieu Convolutions

  title={Audio Transformers: Transformer Architectures For Large Scale Audio Understanding. Adieu Convolutions},
  author={Prateek Verma and Jonathan Berger},
Over the past two decades, CNN architectures have produced compelling models of sound perception and cognition, learning hierarchical organizations of features. Analogous to successes in computer vision, audio feature classification can be optimized for a particular task of interest, over a wide variety of datasets and labels. In fact similar architectures designed for image understanding have proven effective for acoustic scene analysis. Here we propose applying Transformer based architectures… 

Figures and Tables from this paper

GoodBye WaveNet - A Language Model for Raw Audio with Context of 1/2 Million Samples

This work proposes a generative auto-regressive architecture that can model audio waveforms over quite a large context, greater than 500,000 samples, on a standard dataset, with the same number of parameters/context to show improvements.


This work shows how to surpass traditional convolutional neural network architectures, and come strikingly close to outperforming powerful Transformer architectures, which would pave way for exciting advancements in the field of representation learning without massive, end-to-end neural architectures.

Large Scale Audio Understanding without Transformers/ Convolutions/ BERTs/ Mixers/ Attention/ RNNs or

This work shows how to surpass traditional convolutional neural network architectures, and come strikingly close to outperforming powerful Transformer architectures, which would pave way for exciting advancements in the field of representation learning without massive, end-to-end neural architectures.

Adversarial Audio Detection Method Based on Transformer

  • Yunchen LiD. Luo
  • Computer Science
    2022 International Conference on Machine Learning and Intelligent Systems Engineering (MLISE)
  • 2022
An adversarial detection framework to detect adversarial audio examples based on the transformer self-attention mechanism that achieves good performance with the detection accuracy of above 96.5% under the white-box attacks and blackbox attacks, and noisy circumstances.

One-Shot Acoustic Matching Of Audio Signals -- Learning to Hear Music In Any Room/ Concert Hall

The acoustic space in which a sound is created and heard plays an essential role in how that sound is perceived by af-fording a unique sense of presence . Every sound we hear results from successive

H4VDM: H.264 Video Device Matching

This paper proposes a technique that can determine if two given video sequences are captured by the same device, even if the method has never encountered the device in training, and denotes it as H.264 Video Device Matching (H4VDM).

Generating Coherent Drum Accompaniment With Fills And Improvisations

This work uses the transformer sequence to sequence model to generate a basic drum pattern conditioned on the melodic accompaniment, and proposes a novelty function to capture the extent of improvisation in a bar relative to its neighbors.

Enhancing Audio Perception of Music By AI Picked Room Acoustics

Every sound that we hear is the result of suc-cessive convolutional operations (e.g. room acoustics, microphone characteristics, resonant properties of the instrument itself, not to mention

It's Time for Artistic Correspondence in Music and Video

A self-supervised approach that learns this correspondence directly from data, without any need of human annotations, using Transformer networks for each modality to model the long-term temporal context of both the video and the music signals.

CMKD: CNN/Transformer-Based Cross-Model Knowledge Distillation for Audio Classification

An intriguing interaction is found between the two very different models CNN and AST models are good teachers for each other and when either of them is used as the teacher and the other model is trained as the student via knowledge distillation, the performance of the student model noticeably improves, and in many cases, is better than the teacher model.



FSD50K: An Open Dataset of Human-Labeled Sound Events

FSD50K is introduced, an open dataset containing over 51 k audio clips totalling over 100 h of audio manually labeled using 200 classes drawn from the AudioSet Ontology, to provide an alternative benchmark dataset and thus foster SER research.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Frequency Estimation from Waveforms Using Multi-Layered Neural Networks

It is found that learning representations from raw time-domain signals can achieve performance on par with the current state of the art algorithms for frequency estimation in noisy and polyphonic settings.

Audio Set: An ontology and human-labeled dataset for audio events

The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

This work simplifies the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs, and advances the current scale of language models by pre-training up to trillion parameter models on the “Colossal Clean Crawled Corpus”, and achieves a 4x speedup over the T5-XXL model.

Language Through a Prism: A Spectral Approach for Multiscale Language Representations

This work applies spectral filters to the activations of a neuron across an input, producing filtered embeddings that perform well on part of speech tagging, dialog speech acts classification, or topic classification, while performing poorly on the other tasks.

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

A Framework for Contrastive and Generative Learning of Audio Representations

This paper presents a framework for contrastive learning for audio representations, in a self supervised frame work without access to any ground truth labels, and explores generative models based on state of the art transformer based architectures for learning latent spaces for audio signals.

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being

Language Models are Few-Shot Learners

GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.