End-to-end music source separation: is it possible in the waveform domain?

@inproceedings{Llus2019EndtoendMS,
  title={End-to-end music source separation: is it possible in the waveform domain?},
  author={Francesc Llu{\'i}s and Jordi Pons and Xavier Serra},
  booktitle={INTERSPEECH},
  year={2019}
}
Most of the currently successful source separation techniques use the magnitude spectrogram as input, and are therefore by default omitting part of the signal: the phase. To avoid omitting potentially useful information, we study the viability of using end-to-end models for music source separation --- which take into account all the information available in the raw audio signal, including the phase. Although during the last decades end-to-end music source separation has been considered almost… 

Figures and Tables from this paper

End-to-end Sound Source Separation Conditioned on Instrument Labels
TLDR
This paper presents an extension of the Wave-U-Net model which allows end-to-end monaural source separation with a non-fixed number of sources and proposes multiplicative conditioning with instrument labels at the bottleneck of thewave-u-Net.
Music Source Separation in the Waveform Domain
TLDR
Demucs is proposed, a new waveform-to-waveform model, which has an architecture closer to models for audio generation with more capacity on the decoder, and human evaluations show that Demucs has significantly higher quality than Conv-Tasnet, but slightly more contamination from other sources, which explains the difference in SDR.
Decoupling Magnitude and Phase Estimation with Deep ResUNet for Music Source Separation
TLDR
This work proposes to estimate phases by estimating complex ideal ratio masks (cIRMs) where it decouple the estimation of cIRMs into magnitude and phase estimations, and extends the separation method to effectively allow the magnitude of the mask to be larger than 1.
Time-Domain Audio Source Separation Based on Wave-U-Net Combined with Discrete Wavelet Transform
  • Tomohiko Nakamura, H. Saruwatari
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
TLDR
It is found that this architecture resembles that of multiresolution analysis, and it is revealed that the DS layers of Wave-U-Net cause aliasing and may discard information useful for the separation.
Hybrid Spectrogram and Waveform Source Separation
TLDR
This work shows how to perform end-to-end hybrid source separation, letting the model decide which domain is best suited for each source, and even combining both.
Musical source separation
TLDR
This work studies how to extend the number of instrument categories and concludes that electric guitar is also feasible to separate and tries to adapt models trained on studio music to live music separation, concluding that models training on clean data also provide the best performance on live music as well.
MDCNN-SID: Multi-scale Dilated Convolution Network for Singer Identification
TLDR
This paper proposes an end-to-end architecture that addresses the problem of wave embedding in the waveform domain and achieves comparable performance on the benchmark dataset of Artist20, which significantly improves related works.
Demucs: Deep Extractor for Music Sources with extra unlabeled data remixed
TLDR
A simple convolutional and recurrent model is introduced that outperforms the state-of-the-art model on waveforms, that is, Wave-U-Net, by 1.6 points of SDR (signal to distortion ratio) and a new scheme to leverage unlabeled music is proposed.
SpaIn-Net: Spatially-Informed Stereophonic Music Source Separation
TLDR
This work introduces a control method based on the stereophonic location of the sources of interest, expressed as the panning angle, and presents various conditioning mechanisms, including the use of raw angle and its derived feature representations, and shows that spatial information helps.
Unsupervised Audio Source Separation using Generative Priors
TLDR
This work proposes a novel approach for audio source separation based on generative priors trained on individual sources that simultaneously searches in the source-specific latent spaces to effectively recover the constituent sources through the use of projected gradient descent optimization.
...
...

References

SHOWING 1-10 OF 35 REFERENCES
End-To-End Source Separation With Adaptive Front-Ends
TLDR
An auto-encoder neural network is developed that can act as an equivalent to short-time front-end transforms and demonstrate the ability of the network to learn optimal, real-valued basis functions directly from the raw waveform of a signal.
Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation
TLDR
The Wave-U-Net is proposed, an adaptation of the U-Net to the one-dimensional time domain, which repeatedly resamples feature maps to compute and combine features at different time scales and indicates that its architecture yields a performance comparable to a state-of-the-art spectrogram-based U- net architecture, given the same data.
End-to-end Networks for Supervised Single-channel Speech Separation
TLDR
An end-to-end source separation network that allows us to estimate the separated speech waveform by operating directly on the raw waveform of the mixture by investigating the use of composite cost functions that are derived from objective evaluation metrics as measured on waveforms.
TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation
  • Yi Luo, N. Mesgarani
  • Computer Science
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
TLDR
Time-domain Audio Separation Network (TasNet) is proposed, which outperforms the current state-of-the-art causal and noncausal speech separation algorithms, reduces the computational cost of speech separation, and significantly reduces the minimum required latency of the output.
A Wavenet for Speech Denoising
TLDR
The proposed model adaptation retains Wavenet's powerful acoustic modeling capabilities, while significantly reducing its time-complexity by eliminating its autoregressive nature.
End-to-end Learning for Music Audio Tagging at Scale
TLDR
This work focuses on studying how waveform-based models outperform spectrogram-based ones in large-scale data scenarios when datasets of variable size are available for training, suggesting that music domain assumptions are relevant when not enough training data are available.
Does Phase Matter For Monaural Source Separation?
TLDR
The results demonstrate that preserving phase information reduces artifacts in the separated tracks, as quantified by the signal to artifact ratio (GSAR), and the proposed method achieves state-of-the-art performance for source separation.
Raw Multi-Channel Audio Source Separation using Multi- Resolution Convolutional Auto-Encoders
TLDR
This work introduces a novel multi-channel, multiresolution convolutional auto-encoder neural network that works on raw time-domain signals to determine appropriateMultiresolution features for separating the singing-voice from stereo music.
Singing Voice Separation with Deep U-Net Convolutional Networks
TLDR
This work proposes a novel application of the U-Net architecture — initially developed for medical imaging — for the task of source separation, given its proven capacity for recreating the fine, low-level detail required for high-quality audio reproduction.
Singing-Voice Separation from Monaural Recordings using Deep Recurrent Neural Networks
TLDR
This paper explores using deep recurrent neural networks for singing voice separation from monaural recordings in a supervised setting and proposes jointly optimizing the networks for multiple source signals by including the separation step as a nonlinear operation in the last layer.
...
...