• Corpus ID: 245005743

CWS-PResUNet: Music Source Separation with Channel-wise Subband Phase-aware ResUNet

@article{Liu2021CWSPResUNetMS,
  title={CWS-PResUNet: Music Source Separation with Channel-wise Subband Phase-aware ResUNet},
  author={Haohe Liu and Qiuqiang Kong and Jiafeng Liu},
  journal={ArXiv},
  year={2021},
  volume={abs/2112.04685}
}
Music source separation (MSS) shows active progress with deep learning models in recent years. Many MSS models perform separations on spectrograms by estimating bounded ratio masks and reusing the phases of the mixture. When using convolutional neural networks (CNN), weights are usually shared within a spectrogram during convolution regardless of the different patterns between frequency bands. In this study, we propose a new MSS model, channel-wise subband phase-aware ResUNet (CWS-PResUNet), to… 

Figures and Tables from this paper

Music Source Separation with Band-split RNN

BSRNN is proposed, a frequency-domain model that explictly splits the spectrogram of the mixture into subbands and perform interleaved band-level and sequence-level modeling and describes a semi-supervised model netuning pipeline that can further improve the performance of the model.

Non-intrusive Speech Quality Assessment with a Multi-Task Learning based Subband Adaptive Attention Temporal Convolutional Neural Network

A new multi-task learning based model, termed as subband adaptive attention temporal convolutional neural network (SAA-TCN), to perform non-intrusive speech quality assessment with the help of MOS value interval detector (VID) auxiliary task is proposed.

Music Separation Enhancement with Generative Modeling

A post-processing generative model (the Make it Sound Good (MSG) post-processor) is proposed to enhance the output of music source separation systems and it is demonstrated that human listeners prefer source estimates of bass and drums that have been post-processed by MSG.

Audio Visual Multi-Speaker Tracking with Improved GCF and PMBM Filter

An adaptive audio measurement likelihood for audio-visual multiple speaker tracking using Poisson multi-Bernoulli mixture (PMBM) filter and a phase-aware VoiceFilter and a separation-before-localization method, which enables the audio mixture to be separated into individual speech sources while retaining their phases.

VoiceFixer: A Unified Framework for High-Fidelity Speech Restoration

Both objective and subjective evaluations show that VoiceFixer is effective on severely degraded speech, such as real-world his-torical speech recordings, and a synthesis stage that generates waveform using a neural vocoder.

Separate What You Describe: Language-Queried Audio Source Separation

This paper proposes LASS-Net, an end-to-end neural network that is learned to jointly process acoustic and linguistic information, and separate the target source that is consistent with the language query from an audio mixture.

An Efficient Short-Time Discrete Cosine Transform and Attentive MultiResUNet Framework for Music Source Separation

A novel Attentive MultiResUNet architecture is proposed, that uses real-valued Short-Time Discrete Cosine Transform data as inputs and is used for the first time in source separation and is more computationally efficient than state-of-the-art separation networks.

Neural Vocoder is All You Need for Speech Super-resolution

This paper proposes a neural vocoder based speech super-resolution method that can handle a variety of input resolution and upsampling ratios and demonstrates that prior knowledge in the pre-trained vocoder is crucial for speech SR by performing mel-bandwidth extension with a simple replication-padding method.

References

SHOWING 1-10 OF 16 REFERENCES

Channel-wise Subband Input for Better Voice and Accompaniment Separation on High Resolution Music

A new input format, channel-wise subband input (CWS), for convolutional neural networks (CNN) based music source separation (MSS) models in the frequency domain, which enables effective weight sharing in each subband and introduces more flexibility between channels.

Decoupling Magnitude and Phase Estimation with Deep ResUNet for Music Source Separation

This work proposes to estimate phases by estimating complex ideal ratio masks (cIRMs) where it decouple the estimation of cIRMs into magnitude and phase estimations, and extends the separation method to effectively allow the magnitude of the mask to be larger than 1.

Demucs: Deep Extractor for Music Sources with extra unlabeled data remixed

A simple convolutional and recurrent model is introduced that outperforms the state-of-the-art model on waveforms, that is, Wave-U-Net, by 1.6 points of SDR (signal to distortion ratio) and a new scheme to leverage unlabeled music is proposed.

Voice and accompaniment separation in music using self-attention convolutional neural network

This work proposes a novel self-attention network to separate voice and accompaniment in music and shows the proposed method leads to 19.5% relative improvement in vocals separation in terms of SDR.

D3Net: Densely connected multidilated DenseNet for music source separation

D3Net involves a novel multi-dilated convolution that has different dilation factors in a single layer to model different resolutions simultaneously and avoids the aliasing problem that exists when the authors naively incorporate the dilated convolution in DenseNet.

Investigating U-Nets with various Intermediate Blocks for Spectrogram-based Singing Voice Separation

A variety of intermediate spectrogram transformation blocks are introduced and a variety of U-nets based on these blocks are implemented and trained on complex-valued spectrograms to consider both magnitude and phase.

All For One And One For All: Improving Music Separation By Bridging Networks

Experimental results show that the performance of Open-Unmix (UMX), a well-known and state-of-the-art open-source library for music separation, can be improved by utilizing a multi-domain loss (MDL) and two combination schemes.

Music Demixing Challenge at ISMIR 2021

The Music Demixing (MDX) Challenge is designed on a crowd-based machine learning competition platform where the task is to separate stereo songs into four instrument stems (Vocals, Drums, Bass, Other).

Singing Voice Separation with Deep U-Net Convolutional Networks

This work proposes a novel application of the U-Net architecture — initially developed for medical imaging — for the task of source separation, given its proven capacity for recreating the fine, low-level detail required for high-quality audio reproduction.

Multi-task U-Net for Music Source Separation

A multi-task U-Net trained using a weighted multi- task loss as an alternative to the C-U-Net is proposed and two weighting strategies are investigated: Dynamic Weighted Average (DWA), and Energy Based Weighting (EBW).