Multichannel-based Learning for Audio Object Extraction

@article{Arteaga2021MultichannelbasedLF,
  title={Multichannel-based Learning for Audio Object Extraction},
  author={Daniel Arteaga and Jordi Pons},
  journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2021},
  pages={206-210}
}
  • D. Arteaga, Jordi Pons
  • Published 11 February 2021
  • Computer Science
  • ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
The current paradigm for creating and deploying immersive audio content is based on audio objects, which are composed of an audio track and position metadata. While rendering an object-based production into a multichannel mix is straightforward, the reverse process involves sound source separation and estimating the spatial trajectories of the extracted sources. Besides, cinematic object-based productions are often composed by dozens of simultaneous audio objects, which poses a scalability… 

Figures from this paper

References

SHOWING 1-10 OF 25 REFERENCES
SDR – Half-baked or Well Done?
TLDR
It is argued here that the signal-to-distortion ratio (SDR) implemented in the BSS_eval toolkit has generally been improperly used and abused, especially in the case of single-channel separation, resulting in misleading results.
FSD50K: An Open Dataset of Human-Labeled Sound Events
TLDR
FSD50K is introduced, an open dataset containing over 51 k audio clips totalling over 100 h of audio manually labeled using 200 classes drawn from the AudioSet Ontology, to provide an alternative benchmark dataset and thus foster SER research.
An Empirical Study of Conv-Tasnet
TLDR
An empirical study of Conv-TasNet is conducted and an enhancement to the encoder/decoder that is based on a (deep) non-linear variant of it is proposed that can improve average SI-SNR performance by more than 1 dB.
DDSP: Differentiable Digital Signal Processing
TLDR
The Differentiable Digital Signal Processing library is introduced, which enables direct integration of classic signal processing elements with deep learning methods and achieves high-fidelity generation without the need for large autoregressive models or adversarial losses.
Multi-Task Self-Supervised Learning for Robust Speech Recognition
TLDR
PASE+ is proposed, an improved version of PASE that better learns short- and long-term speech dynamics with an efficient combination of recurrent and convolutional networks and learns transferable representations suitable for highly mismatched acoustic conditions.
Unsupervised Sound Separation Using Mixtures of Mixtures
TLDR
This paper proposes a completely unsupervised method, mixture invariant training (MixIT), that requires only single-channel acoustic mixtures and shows that MixIT can achieve competitive performance compared to supervised methods on speech separation.
Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations
TLDR
This paper theoretically shows that the unsupervised learning of disentangled representations is fundamentally impossible without inductive biases on both the models and the data, and trains more than 12000 models covering most prominent methods and evaluation metrics on seven different data sets.
Divide and Conquer: A Deep CASA Approach to Talker-Independent Monaural Speaker Separation
TLDR
The proposed deep CASA approach optimizes frame-level separation and speaker tracking in turn, and produces excellent results for both objectives, with a modest model size.
End-to-end music source separation: is it possible in the waveform domain?
TLDR
A Wavenet-based model is proposed and Wave-U-Net can outperform DeepConvSep, a recent spectrogram-based deep learning model, and the results confirm that waveform-based models can perform similarly (if not better) than a spectrogram/deep learning model.
Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks
TLDR
Experiments show that the proposed improved self-supervised method can learn transferable, robust, and problem-agnostic features that carry on relevant information from the speech signal, such as speaker identity, phonemes, and even higher-level features such as emotional cues.
...
1
2
3
...