• Corpus ID: 243833015

Hybrid Spectrogram and Waveform Source Separation

@article{Defossez2021HybridSA,
  title={Hybrid Spectrogram and Waveform Source Separation},
  author={Alexandre D'efossez},
  journal={ArXiv},
  year={2021},
  volume={abs/2111.03600}
}
Source separation models either work on the spectrogram or waveform domain. In this work, we show how to perform end-to-end hybrid source separation, letting the model decide which domain is best suited for each source, and even combining both. The proposed hybrid version of the Demucs architecture (Défossez et al., 2019) won the Music Demixing Challenge 2021 organized by Sony. This architecture also comes with additional improvements, such as compressed residual branches, local attention or… 

Figures and Tables from this paper

Towards Low-distortion Multi-channel Speech Enhancement: The ESPNet-SE Submission to The L3DAS22 Challenge
TLDR
The proposed method, which combines Deep Neural Network driven complex spectral mapping with linear beamformers such as the multi-frame multi-channel Wiener filter, was ranked first in the challenge, achieving a ranking metric of 0.984, versus 0.833 of the challenge baseline.
Removing Distortion Effects in Music Using Deep Neural Networks
TLDR
This paper focuses on removing distortion and clipping applied to guitar tracks for music production while presenting a comparative investigation of different deep neural network (DNN) architectures on this task, achieving exceptionally good results in distortion removal using DNNs.
Implicit Neural Spatial Filtering for Multichannel Source Separation in the Waveform Domain
TLDR
This work presents a single-stage casual waveform-to-waveform multichannel model that can separate moving sound sources based on their broad spatial locations in a dynamic acoustic scene and shows that the model matches the performance of an oracle beamformer followed by a state-of-the-art single-channel enhancement network.
Music Source Separation with Generative Flow
TLDR
Experiments show that in singing voice and music separation tasks, these proposed systems achieve competitive results to one of the full supervision systems, and one variant of the proposed systems is capable of separating new source tracks effortlessly.
Music Demixing Challenge 2021
TLDR
The Music Demixing Challenge on a crowd-based machine learning competition platform where the task is to separate stereo songs into four instrument stems and the dataset provides a wider range of music genres and involved a greater number of mixing engineers.
Towards Transfer Learning of wav2vec 2.0 for Automatic Lyric Transcription
TLDR
This work proposes a transfer-learning-based ALT solution that takes advantage of the similarities between speech and singing by adapting wav2vec 2.0, an SSL ASR model, to the singing domain and enhances the performance by extending the original CTC model to a hybrid CTC/attention model.
Deep Audio Waveform Prior
TLDR
This work shows that existing State-Of-The-Art (SOTA) architectures for audio source separation contain deep priors even when working with the raw waveform.
GAFX: A General Audio Feature eXtractor
TLDR
A General Audio Feature eXtractor (GAFX) based on a dual U-Net, Res net, ResNet, and Attention modules and proposed a GAFX-U model, which following the Audio Spectrogram Transformer (AST) classifier achieves com-petitive performance.

References

SHOWING 1-10 OF 34 REFERENCES
Music Source Separation in the Waveform Domain
TLDR
Demucs is proposed, a new waveform-to-waveform model, which has an architecture closer to models for audio generation with more capacity on the decoder, and human evaluations show that Demucs has significantly higher quality than Conv-Tasnet, but slightly more contamination from other sources, which explains the difference in SDR.
End-to-end music source separation: is it possible in the waveform domain?
TLDR
A Wavenet-based model is proposed and Wave-U-Net can outperform DeepConvSep, a recent spectrogram-based deep learning model, and the results confirm that waveform-based models can perform similarly (if not better) than a spectrogram/deep learning model.
Decoupling Magnitude and Phase Estimation with Deep ResUNet for Music Source Separation
TLDR
This work proposes to estimate phases by estimating complex ideal ratio masks (cIRMs) where it decouple the estimation of cIRMs into magnitude and phase estimations, and extends the separation method to effectively allow the magnitude of the mask to be larger than 1.
Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation
TLDR
The Wave-U-Net is proposed, an adaptation of the U-Net to the one-dimensional time domain, which repeatedly resamples feature maps to compute and combine features at different time scales and indicates that its architecture yields a performance comparable to a state-of-the-art spectrogram-based U- net architecture, given the same data.
Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation
  • Yi Luo, N. Mesgarani
  • Computer Science
    IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2019
TLDR
A fully convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time- domain speech separation, which significantly outperforms previous time–frequency masking methods in separating two- and three-speaker mixtures.
All For One And One For All: Improving Music Separation By Bridging Networks
TLDR
Experimental results show that the performance of Open-Unmix (UMX), a well-known and state-of-the-art open-source library for music separation, can be improved by utilizing a multi-domain loss (MDL) and two combination schemes.
MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis
TLDR
The model is non-autoregressive, fully convolutional, with significantly fewer parameters than competing models and generalizes to unseen speakers for mel-spectrogram inversion, and suggests a set of guidelines to design general purpose discriminators and generators for conditional sequence synthesis tasks.
Lasaft: Latent Source Attentive Frequency Transformation For Conditioned Source Separation
TLDR
The LaSAFT and GPoCM block to capture source-dependent frequency patterns and an extension of Feature-wise Linear Modulation to modulate internal features can improve the CUNet’s performance, achieving state-of-the-art SDR performance on several MUSDB18 source separation tasks.
Investigating U-Nets with various Intermediate Blocks for Spectrogram-based Singing Voice Separation
TLDR
A variety of intermediate spectrogram transformation blocks are introduced and a variety of U-nets based on these blocks are implemented and trained on complex-valued spectrograms to consider both magnitude and phase.
Performance measurement in blind audio source separation
TLDR
This paper considers four different sets of allowed distortions in blind audio source separation algorithms, from time-invariant gains to time-varying filters, and derives a global performance measure using an energy ratio, plus a separate performance measure for each error term.
...
...