Depthwise Separable Convolutions Versus Recurrent Neural Networks for Monaural Singing Voice Separation

  title={Depthwise Separable Convolutions Versus Recurrent Neural Networks for Monaural Singing Voice Separation},
  author={Pyry Pyykk{\"o}nen and Styliannos I. Mimilakis and Konstantinos Drossos and Tuomas Virtanen},
  journal={2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)},
Recent approaches for music source separation are almost exclusively based on deep neural networks, mostly employing recurrent neural networks (RNNs). Although RNNs are in many cases superior than other types of deep neural networks for sequence processing, they are known to have specific difficulties in training and parallelization, especially for the typically long sequences encountered in music source separation. In this paper we present a use-case of replacing RNNs with depth-wise separable… Expand
A Throughput-Optimized Channel-Oriented Processing Element Array for Convolutional Neural Networks
  • Yu-Xian Chen, S. Ruan
  • Computer Science
  • IEEE Transactions on Circuits and Systems II: Express Briefs
  • 2021
A throughput-optimized PE array for CNNs based on the channel-oriented data pattern is proposed, which achieves an improvement in the throughput density on AlexNet and VGG-16 respectively. Expand


A recurrent encoder-decoder approach with skip-filtering connections for monaural singing voice separation
The results from an objective evaluation show that the proposed method provides comparable results to deep learning based methods which operate over complicated signal representations, as compared to previous methods that approximate time-frequency masks. Expand
Examining the Mapping Functions of Denoising Autoencoders in Singing Voice Separation
Experimental findings show that approaches based on the DAE model learn scalar filtering operators, exhibiting a predominant diagonal structure in their corresponding mapping functions, limiting the exploitation of inter-frequency structure of music data. Expand
MaD TwinNet: Masker-Denoiser Architecture with Twin Networks for Monaural Sound Source Separation
This work builds upon the recently proposed Masker-Denoiser (MaD) architecture and enhances it with the Twin Networks, a technique to regularize a recurrent generative network using a backward running copy of the network. Expand
Sound Event Detection with Depthwise Separable and Dilated Convolutions
The proposed method is compared to a baseline convolutional neural network on a SED task, and achieves a reduction of the amount of parameters by 85% and average training time per epoch by 78%, and an increase the average frame-wise F1 score and reduction ofThe average error rate by 4.6% and 3.8%, respectively. Expand
Monaural Singing Voice Separation with Skip-Filtering Connections and Recurrent Inference of Time-Frequency Mask
A recurrent inference algorithm, a sparse transformation step to improve the mask generation process, and a learned denoising filter are introduced that learns and optimizes a source-dependent mask and does not need a post processing step. Expand
Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation
  • Yi Luo, N. Mesgarani
  • Computer Science, Medicine
  • IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2019
A fully convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time- domain speech separation, which significantly outperforms previous time–frequency masking methods in separating two- and three-speaker mixtures. Expand
Towards Real-Time Single-Channel Singing-Voice Separation with Pruned Multi-Scaled Densenets
The multi-scaled DenseNet is extended in several aspects to facilitate real-time source separation scenarios and significantly reduces the model size and increases the computational efficiency by a factor of 1.6 and 4.3, while maintaining the separation performance. Expand
Multi-Scale multi-band densenets for audio source separation
A novel network architecture that extends the recently developed densely connected convolutional network (DenseNet) and takes advantage of long contextual information and outperforms state-of-the-art results on SiSEC 2016 competition by a large margin in terms of signal-to-distortion ratio. Expand
Music Source Separation in the Waveform Domain
Demucs is proposed, a new waveform-to-waveform model, which has an architecture closer to models for audio generation with more capacity on the decoder, and human evaluations show that Demucs has significantly higher quality than Conv-Tasnet, but slightly more contamination from other sources, which explains the difference in SDR. Expand
Open-Unmix - A Reference Implementation for Music Source Separation
Open-Unmix provides implementations for the most popular deep learning frameworks, giving researchers a flexible way to reproduce results and provides a pre-trained model for end users and even artists to try and use source separation. Expand