• Corpus ID: 220919894

Diet deep generative audio models with structured lottery

  title={Diet deep generative audio models with structured lottery},
  author={Philippe Esling and Ninon Devis and Adrien Bitton and Antoine Caillon and Axel Chemla-Romeu-Santos and Constance Douwes},
Deep learning models have provided extremely successful solutions in most audio application fields. However, the high accuracy of these models comes at the expense of a tremendous computation cost. This aspect is almost always overlooked in evaluating the quality of proposed models. However, models should not be evaluated without taking into account their complexity. This aspect is especially critical in audio applications, which heavily relies on specialized embedded hardware with real-time… 

Figures and Tables from this paper

EPIC TTS Models: Empirical Pruning Investigations Characterizing Text-To-Speech Models

It is shown that pruning before or during training can achieve similar performance to pruning after training and can be trained much faster, while removing entire neurons degrades performance much more than removing parameters.



Rethinking the Value of Network Pruning

It is found that with optimal learning rate, the "winning ticket" initialization as used in Frankle & Carbin (2019) does not bring improvement over random initialization, and the need for more careful baseline evaluations in future research on structured pruning methods is suggested.

GANSynth: Adversarial Neural Audio Synthesis

Through extensive empirical investigations on the NSynth dataset, it is demonstrated that GANs are able to outperform strong WaveNet baselines on automated and human evaluation metrics, and efficiently generate audio several orders of magnitude faster than their autoregressive counterparts.

SNIP: Single-shot Network Pruning based on Connection Sensitivity

This work presents a new approach that prunes a given network once at initialization prior to training, and introduces a saliency criterion based on connection sensitivity that identifies structurally important connections in the network for the given task.

Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders

A powerful new WaveNet-style autoencoder model is detailed that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform, and NSynth, a large-scale and high-quality dataset of musical notes that is an order of magnitude larger than comparable public datasets is introduced.

Stabilizing the Lottery Ticket Hypothesis

This paper modifications IMP to search for subnetworks that could have been obtained by pruning early in training rather than at iteration 0, and studies subnetwork "stability," finding that - as accuracy improves in this fashion - IMP subnets train to parameters closer to those of the full network and do so with improved consistency in the face of gradient noise.

On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization

This paper suggests that, sometimes, increasing depth can speed up optimization and proves that it is mathematically impossible to obtain the acceleration effect of overparametrization via gradients of any regularizer.

WaveNet: A Generative Model for Raw Audio

WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.

The Early Phase of Neural Network Training

It is found that deep networks are not robust to reinitializing with random weights while maintaining signs, and that weight distributions are highly non-independent even after only a few hundred iterations.

SING: Symbol-to-Instrument Neural Generator

This work presents a lightweight neural audio synthesizer trained end-to-end to generate notes from nearly 1000 instruments with a single decoder, thanks to a new loss function that minimizes the distances between the log spectrograms of the generated and target waveforms.

SampleRNN: An Unconditional End-to-End Neural Audio Generation Model

It is shown that the model, which profits from combining memory-less modules, namely autoregressive multilayer perceptrons, and stateful recurrent neural networks in a hierarchical structure is able to capture underlying sources of variations in the temporal sequences over very long time spans, on three datasets of different nature.