Multichannel-based Learning for Audio Object Extraction

  title={Multichannel-based Learning for Audio Object Extraction},
  author={Daniel Arteaga and Jordi Pons},
  journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  • D. Arteaga, Jordi Pons
  • Published 11 February 2021
  • Computer Science
  • ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
The current paradigm for creating and deploying immersive audio content is based on audio objects, which are composed of an audio track and position metadata. While rendering an object-based production into a multichannel mix is straightforward, the reverse process involves sound source separation and estimating the spatial trajectories of the extracted sources. Besides, cinematic object-based productions are often composed by dozens of simultaneous audio objects, which poses a scalability… 

Figures from this paper


SDR – Half-baked or Well Done?
It is argued here that the signal-to-distortion ratio (SDR) implemented in the BSS_eval toolkit has generally been improperly used and abused, especially in the case of single-channel separation, resulting in misleading results.
FSD50K: An Open Dataset of Human-Labeled Sound Events
FSD50K is introduced, an open dataset containing over 51 k audio clips totalling over 100 h of audio manually labeled using 200 classes drawn from the AudioSet Ontology, to provide an alternative benchmark dataset and thus foster SER research.
Unsupervised Sound Separation Using Mixtures of Mixtures
This paper proposes a completely unsupervised method, mixture invariant training (MixIT), that requires only single-channel acoustic mixtures and shows that MixIT can achieve competitive performance compared to supervised methods on speech separation.
An Empirical Study of Conv-Tasnet
An empirical study of Conv-TasNet is conducted and an enhancement to the encoder/decoder that is based on a (deep) non-linear variant of it is proposed that can improve average SI-SNR performance by more than 1 dB.
Multi-Task Self-Supervised Learning for Robust Speech Recognition
PASE+ is proposed, an improved version of PASE that better learns short- and long-term speech dynamics with an efficient combination of recurrent and convolutional networks and learns transferable representations suitable for highly mismatched acoustic conditions.
DDSP: Differentiable Digital Signal Processing
The Differentiable Digital Signal Processing library is introduced, which enables direct integration of classic signal processing elements with deep learning methods and achieves high-fidelity generation without the need for large autoregressive models or adversarial losses.
Music Source Separation in the Waveform Domain
Demucs is proposed, a new waveform-to-waveform model, which has an architecture closer to models for audio generation with more capacity on the decoder, and human evaluations show that Demucs has significantly higher quality than Conv-Tasnet, but slightly more contamination from other sources, which explains the difference in SDR.
Open-Unmix - A Reference Implementation for Music Source Separation
Open-Unmix provides implementations for the most popular deep learning frameworks, giving researchers a flexible way to reproduce results and provides a pre-trained model for end users and even artists to try and use source separation.
Universal Sound Separation
A dataset of mixtures containing arbitrary sounds is developed, and the best methods produce an improvement in scale-invariant signal-to-distortion ratio of over 13 dB for speech/non-speech separation and close to 10 dB for universal sound separation.
Divide and Conquer: A Deep CASA Approach to Talker-Independent Monaural Speaker Separation
The proposed deep CASA approach optimizes frame-level separation and speaker tracking in turn, and produces excellent results for both objectives, with a modest model size.