• Corpus ID: 247011489

It's Raw! Audio Generation with State-Space Models

  title={It's Raw! Audio Generation with State-Space Models},
  author={Karan Goel and Albert Gu and Chris Donahue and Christopher R'e},
Developing architectures suitable for modeling raw audio is a challenging problem due to the high sampling rates of audio waveforms. Standard sequence modeling approaches like RNNs and CNNs have previously been tailored to fit the demands of audio, but the resultant architectures make undesirable computational tradeoffs and struggle to model waveforms effectively. We propose SaShiMi, a new multi-scale architecture for waveform modeling built around the recently introduced S4 model for long… 
GoodBye WaveNet -- A Language Model for Raw Audio with Context of 1/2 Million Samples
This work proposes a generative auto-regressive architecture that can model audio waveforms over quite a large context, greater than 500,000 samples, on a standard dataset for modeling long-term structure.
Diagonal State Spaces are as Effective as Structured State Spaces
The Diagonal State Space (DSS) model matches the performance of S4 on Long Range Arena tasks, speech classification on Speech Commands dataset, while being conceptually simpler and straightforward to implement.
Multi-instrument Music Synthesis with Spectrogram Diffusion
This work compares training the decoder as an autoregressive model and as a Denoising Diffusion Probabilistic Model (DDPM) and finds that the DDPM approach is superior both qualitatively and as measured by audio reconstruction and Fréchet distance metrics.
Adversarial Audio Synthesis with Complex-valued Polynomial Networks
This work introduces complexvalued polynomial networks, called APOLLO, that integrate such complex-valued representations in a natural way and captures high-order correlations of the input elements using high- order tensors as scaling parameters.
On the Parameterization and Initialization of Diagonal State Space Models
A simple diagonal version of S4 whose kernel computation requires just 2 lines of code and performs comparably to S4 in almost all settings, with state-of-the-art results for image, audio, and medical time-series domains, and averaging 85% on the Long Range Arena benchmark.
Efficiently Modeling Long Sequences with Structured State Spaces
The Structured State Space (S4) sequence model is proposed based on a new parameterization for the SSM, and it is shown that it can be computed much more efficiently than prior approaches while preserving their theoretical strengths.
SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with Adaptive Noise Spectral Shaping
Experimental results showed that SpecGrad generates higher-fidelity speech waveform than conventional DDPM-based neural vocoders in both analysis-synthesis and speech enhancement scenarios.
Improving the Diagnosis of Psychiatric Disorders with Self-Supervised Graph State Space Models
This work presents a two-stage framework to improve the diagnosis of heterogeneous psychiatric disorders from resting-state functional magnetic resonance imaging (rs-fMRI), and proposes a self-supervised mask prediction task on data from healthy individuals that can exploit differences between healthy controls and patients in clinical datasets.
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
This work proposes FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM, and is optimal for a range of SRAM sizes.
On the link between conscious function and general intelligence in humans and machines
This work examines the cognitive abilities associated with three contemporary theories of conscious function: Global Workspace Theory (GWT), Information Generation Theory (IGT), and Attention Schema Theory (AST) to propose ways in which insights from each of the three theories may be combined into a unified model.


WaveFlow: A Compact Flow-based Model for Raw Audio
WaveFlow provides a unified view of likelihood-based models for 1-D data, including WaveNet and WaveGlow as special cases, while synthesizing several orders of magnitude faster as it only requires a few sequential steps to generate very long waveforms with hundreds of thousands of time-steps.
FloWaveNet : A Generative Flow for Raw Audio
FloWaveNet is proposed, a flow-based generative model for raw audio synthesis that requires only a single-stage training procedure and a single maximum likelihood loss, without any additional auxiliary terms, and it is inherently parallel due to the characteristics of generative flow.
WaveNet: A Generative Model for Raw Audio
WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.
Efficient Neural Audio Synthesis
A single-layer recurrent neural network with a dual softmax layer that matches the quality of the state-of-the-art WaveNet model, the WaveRNN, and a new generation scheme based on subscaling that folds a long sequence into a batch of shorter sequences and allows one to generate multiple samples at once.
The challenge of realistic music generation: modelling raw audio at scale
Autoregressive discrete autoencoders (ADAs) are explored as a means to enable autoregressive models to capture long-range correlations in waveforms and are found to unconditionally generate piano music directly in the raw audio domain, which shows stylistic consistency across tens of seconds.
MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis
The model is non-autoregressive, fully convolutional, with significantly fewer parameters than competing models and generalizes to unseen speakers for mel-spectrogram inversion, and suggests a set of guidelines to design general purpose discriminators and generators for conditional sequence synthesis tasks.
Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders
A powerful new WaveNet-style autoencoder model is detailed that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform, and NSynth, a large-scale and high-quality dataset of musical notes that is an order of magnitude larger than comparable public datasets is introduced.
High Fidelity Speech Synthesis with Adversarial Networks
GAN-TTS is capable of generating high-fidelity speech with naturalness comparable to the state-of-the-art models, and unlike autoregressive models, it is highly parallelisable thanks to an efficient feed-forward generator.
SampleRNN: An Unconditional End-to-End Neural Audio Generation Model
It is shown that the model, which profits from combining memory-less modules, namely autoregressive multilayer perceptrons, and stateful recurrent neural networks in a hierarchical structure is able to capture underlying sources of variations in the temporal sequences over very long time spans, on three datasets of different nature.
DiffWave: A Versatile Diffusion Model for Audio Synthesis
DiffWave significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task in terms of audio quality and sample diversity from various automatic and human evaluations.