• Corpus ID: 47015908

Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation

@article{Stoller2018WaveUNetAM,
  title={Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation},
  author={Daniel Stoller and Sebastian Ewert and Simon Dixon},
  journal={ArXiv},
  year={2018},
  volume={abs/1806.03185}
}
Models for audio source separation usually operate on the magnitude spectrum, which ignores phase information and makes separation performance dependant on hyper-parameters for the spectral front-end. [] Key Method We introduce further architectural improvements, including an output layer that enforces source additivity, an upsampling technique and a context-aware prediction framework to reduce output artifacts. Experiments for singing voice separation indicate that our architecture yields a performance…

Figures and Tables from this paper

End-to-end Networks for Supervised Single-channel Speech Separation

TLDR
An end-to-end source separation network that allows us to estimate the separated speech waveform by operating directly on the raw waveform of the mixture by investigating the use of composite cost functions that are derived from objective evaluation metrics as measured on waveforms.

Towards end-to-end speech enhancement with a variational U-Net architecture

TLDR
Experiments show that the residual (skip) connections in the proposed system are required for successful end-to-end signal enhancement, i.e., without filter mask estimation, and indicate a slight advantage of the variational U-Net architecture over its non-variational version in terms of signal enhancement performance under reverberant conditions.

Time-Domain Audio Source Separation Based on Wave-U-Net Combined with Discrete Wavelet Transform

  • Tomohiko NakamuraH. Saruwatari
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
TLDR
It is found that this architecture resembles that of multiresolution analysis, and it is revealed that the DS layers of Wave-U-Net cause aliasing and may discard information useful for the separation.

Improved Speech Enhancement with the Wave-U-Net

TLDR
The Wave-U-Net architecture, a model introduced by Stoller et al for the separation of music vocals and accompaniment, is studied, finding that a reduced number of hidden layers is sufficient for speech enhancement in comparison to the original system designed for singing voice separation in music.

End-to-end music source separation: is it possible in the waveform domain?

TLDR
A Wavenet-based model is proposed and Wave-U-Net can outperform DeepConvSep, a recent spectrogram-based deep learning model, and the results confirm that waveform-based models can perform similarly (if not better) than a spectrogram/deep learning model.

Speech Enhancement using the Wave-U-Net with Spectral Losses

Speech enhancement and source separation are related tasks that aim to extract and/or improve a signal of interest from a recording that may involve sounds from various sources, reverberation, and/or

Time-Domain Audio Source Separation With Neural Networks Based on Multiresolution Analysis

TLDR
Through music source separation experiments including subjective evaluations, the efficacy of the proposed methods and the importance of simultaneously considering both the anti-aliasing filters and the perfect reconstruction property are shown.

Time-Domain Mapping with Convolution Networks for End-to-End Monaural Speech Separation

TLDR
A time-domain mapping-based algorithm which directly estimate clean speech features in end-to-end system and makes use of an optimal scale-invariant signal to distortion ratio (OSI-SDR) loss function.

End-to-end Sound Source Separation Conditioned on Instrument Labels

TLDR
This paper presents an extension of the Wave-U-Net model which allows end-to-end monaural source separation with a non-fixed number of sources and proposes multiplicative conditioning with instrument labels at the bottleneck of thewave-u-Net.

A Dual-Staged Context Aggregation Method towards Efficient End-to-End Speech Enhancement

  • Kai ZhenMi Suk LeeMinje Kim
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
TLDR
A densely connected convolutional and recurrent network (DCCRN), a hybrid architecture, to enable dual-staged temporal context aggregation with the dense connectivity and cross-component identical shortcut is proposed.
...

References

SHOWING 1-10 OF 27 REFERENCES

End-To-End Source Separation With Adaptive Front-Ends

TLDR
An auto-encoder neural network is developed that can act as an equivalent to short-time front-end transforms and demonstrate the ability of the network to learn optimal, real-valued basis functions directly from the raw waveform of a signal.

TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation

  • Yi LuoN. Mesgarani
  • Computer Science
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
TLDR
Time-domain Audio Separation Network (TasNet) is proposed, which outperforms the current state-of-the-art causal and noncausal speech separation algorithms, reduces the computational cost of speech separation, and significantly reduces the minimum required latency of the output.

Multichannel Audio Source Separation With Deep Neural Networks

TLDR
This article proposes a framework where deep neural networks are used to model the source spectra and combined with the classical multichannel Gaussian model to exploit the spatial information and presents its application to a speech enhancement problem.

A Wavenet for Speech Denoising

TLDR
The proposed model adaptation retains Wavenet's powerful acoustic modeling capabilities, while significantly reducing its time-complexity by eliminating its autoregressive nature.

Raw Multi-Channel Audio Source Separation using Multi- Resolution Convolutional Auto-Encoders

TLDR
This work introduces a novel multi-channel, multiresolution convolutional auto-encoder neural network that works on raw time-domain signals to determine appropriateMultiresolution features for separating the singing-voice from stereo music.

Scalable audio separation with light Kernel Additive Modelling

TLDR
It is shown how KAM can be combined with a fast compression algorithm of its parameters to address the scalability issue, thus enabling its use on small platforms or mobile devices.

Deep clustering and conventional networks for music separation: Stronger together

TLDR
It is shown that deep clustering outperforms conventional networks on a singing voice separation task, in both matched and mismatched conditions, even though conventional networks have the advantage of end-to-end training for best signal approximation.

Singing Voice Separation with Deep U-Net Convolutional Networks

TLDR
This work proposes a novel application of the U-Net architecture — initially developed for medical imaging — for the task of source separation, given its proven capacity for recreating the fine, low-level detail required for high-quality audio reproduction.

WaveNet: A Generative Model for Raw Audio

TLDR
WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.

SEGAN: Speech Enhancement Generative Adversarial Network

TLDR
This work proposes the use of generative adversarial networks for speech enhancement, and operates at the waveform level, training the model end-to-end, and incorporate 28 speakers and 40 different noise conditions into the same model, such that model parameters are shared across them.