A Generative Model for Raw Audio Using Transformer Architectures

@article{Verma2021AGM,
  title={A Generative Model for Raw Audio Using Transformer Architectures},
  author={Prateek Verma and Chris Chafe},
  journal={2021 24th International Conference on Digital Audio Effects (DAFx)},
  year={2021},
  pages={230-237}
}
  • Prateek Verma, C. Chafe
  • Published 30 June 2021
  • Computer Science
  • 2021 24th International Conference on Digital Audio Effects (DAFx)
This paper proposes a novel way of doing audio synthesis at the waveform level using Transformer architectures. We propose a deep neural network for generating waveforms, similar to wavenet [1]. This is fully probabilistic, auto-regressive, and causal, i.e. each sample generated depends on only the previously observed samples. Our approach outperforms a widely used wavenet architecture by up to 9% on a similar dataset for predicting the next step. Using the attention mechanism, we enable the… 

Figures and Tables from this paper

Large Scale Audio Understanding without Transformers/ Convolutions/ BERTs/ Mixers/ Attention/ RNNs or

TLDR
This work shows how to surpass traditional convolutional neural network architectures, and come strikingly close to outperforming powerful Transformer architectures, which would pave way for exciting advancements in the field of representation learning without massive, end-to-end neural architectures.

ATTENTION IS ALL YOU NEED? GOOD EMBEDDINGS WITH STATISTICS ARE ENOUGH AUDIO UNDERSTANDING WITHOUT CONVOLUTIONS/TRANSFORMERS/BERTS/MIXERS/ATTENTION/RNNS

TLDR
This work shows how to surpass traditional convolutional neural network architectures, and come strikingly close to outperforming powerful Transformer architectures, which would pave way for exciting advancements in the field of representation learning without massive, end-to-end neural architectures.

Enhancing Audio Perception of Music By AI Picked Room Acoustics

Every sound that we hear is the result of suc-cessive convolutional operations (e.g. room acoustics, microphone characteristics, resonant properties of the instrument itself, not to mention

MT3: Multi-Task Multitrack Music Transcription

TLDR
This work demonstrates that a general-purpose Transformer model can perform multi-task AMT, jointly transcribing arbitrary combinations of musical instruments across several transcription datasets, dramatically improving performance for low-resource instruments while preserving strong performance for abundant instruments.

Generating Coherent Drum Accompaniment With Fills And Improvisations

TLDR
This work uses the transformer sequence to sequence model to generate a basic drum pattern conditioned on the melodic accompaniment, and proposes a novelty function to capture the extent of improvisation in a bar relative to its neighbors.

GoodBye WaveNet - A Language Model for Raw Audio with Context of 1/2 Million Samples

TLDR
This work proposes a generative auto-regressive architecture that can model audio waveforms over quite a large context, greater than 500,000 samples, on a standard dataset, with the same number of parameters/context to show improvements.

References

SHOWING 1-10 OF 55 REFERENCES

WaveNet: A Generative Model for Raw Audio

TLDR
WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.

Audio Transformers: Transformer Architectures For Large Scale Audio Understanding. Adieu Convolutions

TLDR
This work proposes applying Transformer based architectures without convolutional layers to raw audio signals, and shows how the models learns a non-linear non constant bandwidth filter-bank, which shows an adaptable time frequency front end representation for the task of audio understanding.

A Framework for Contrastive and Generative Learning of Audio Representations

TLDR
This paper presents a framework for contrastive learning for audio representations, in a self supervised frame work without access to any ground truth labels, and explores generative models based on state of the art transformer based architectures for learning latent spaces for audio signals.

Efficient Neural Audio Synthesis

TLDR
A single-layer recurrent neural network with a dual softmax layer that matches the quality of the state-of-the-art WaveNet model, the WaveRNN, and a new generation scheme based on subscaling that folds a long sequence into a batch of shorter sequences and allows one to generate multiple samples at once.

The challenge of realistic music generation: modelling raw audio at scale

TLDR
Autoregressive discrete autoencoders (ADAs) are explored as a means to enable autoregressive models to capture long-range correlations in waveforms and are found to unconditionally generate piano music directly in the raw audio domain, which shows stylistic consistency across tens of seconds.

Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset

TLDR
By using notes as an intermediate representation, a suite of models capable of transcribing, composing, and synthesizing audio waveforms with coherent musical structure on timescales spanning six orders of magnitude are trained, a process the authors call Wave2Midi2Wave.

Conditional End-to-End Audio Transforms

TLDR
An end-to-end method for transforming audio from one style to another based on convolutional and hierarchical recurrent neural networks, designed to capture long-term acoustic dependencies, requires minimal post-processing, and produces realistic audio transforms.

Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions

This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps

Jukebox: A Generative Model for Music

TLDR
It is shown that the combined model at scale can generate high-fidelity and diverse songs with coherence up to multiple minutes, and can condition on artist and genre to steer the musical and vocal style, and on unaligned lyrics to make the singing more controllable.

Generating Long Sequences with Sparse Transformers

TLDR
This paper introduces sparse factorizations of the attention matrix which reduce this to $O(n)$, and generates unconditional samples that demonstrate global coherence and great diversity, and shows it is possible in principle to use self-attention to model sequences of length one million or more.
...