Multichannel-based Learning for Audio Object Extraction
@article{Arteaga2021MultichannelbasedLF, title={Multichannel-based Learning for Audio Object Extraction}, author={Daniel Arteaga and Jordi Pons}, journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, year={2021}, pages={206-210} }
The current paradigm for creating and deploying immersive audio content is based on audio objects, which are composed of an audio track and position metadata. While rendering an object-based production into a multichannel mix is straightforward, the reverse process involves sound source separation and estimating the spatial trajectories of the extracted sources. Besides, cinematic object-based productions are often composed by dozens of simultaneous audio objects, which poses a scalability…
References
SHOWING 1-10 OF 25 REFERENCES
SDR – Half-baked or Well Done?
- GeologyICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
It is argued here that the signal-to-distortion ratio (SDR) implemented in the BSS_eval toolkit has generally been improperly used and abused, especially in the case of single-channel separation, resulting in misleading results.
FSD50K: An Open Dataset of Human-Labeled Sound Events
- Computer ScienceIEEE/ACM Transactions on Audio, Speech, and Language Processing
- 2022
FSD50K is introduced, an open dataset containing over 51 k audio clips totalling over 100 h of audio manually labeled using 200 classes drawn from the AudioSet Ontology, to provide an alternative benchmark dataset and thus foster SER research.
Unsupervised Sound Separation Using Mixtures of Mixtures
- Computer ScienceArXiv
- 2020
This paper proposes a completely unsupervised method, mixture invariant training (MixIT), that requires only single-channel acoustic mixtures and shows that MixIT can achieve competitive performance compared to supervised methods on speech separation.
An Empirical Study of Conv-Tasnet
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
An empirical study of Conv-TasNet is conducted and an enhancement to the encoder/decoder that is based on a (deep) non-linear variant of it is proposed that can improve average SI-SNR performance by more than 1 dB.
Multi-Task Self-Supervised Learning for Robust Speech Recognition
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
PASE+ is proposed, an improved version of PASE that better learns short- and long-term speech dynamics with an efficient combination of recurrent and convolutional networks and learns transferable representations suitable for highly mismatched acoustic conditions.
DDSP: Differentiable Digital Signal Processing
- Computer ScienceICLR
- 2020
The Differentiable Digital Signal Processing library is introduced, which enables direct integration of classic signal processing elements with deep learning methods and achieves high-fidelity generation without the need for large autoregressive models or adversarial losses.
Music Source Separation in the Waveform Domain
- Computer ScienceArXiv
- 2019
Demucs is proposed, a new waveform-to-waveform model, which has an architecture closer to models for audio generation with more capacity on the decoder, and human evaluations show that Demucs has significantly higher quality than Conv-Tasnet, but slightly more contamination from other sources, which explains the difference in SDR.
Open-Unmix - A Reference Implementation for Music Source Separation
- Computer ScienceJ. Open Source Softw.
- 2019
Open-Unmix provides implementations for the most popular deep learning frameworks, giving researchers a flexible way to reproduce results and provides a pre-trained model for end users and even artists to try and use source separation.
Universal Sound Separation
- Computer Science2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
- 2019
A dataset of mixtures containing arbitrary sounds is developed, and the best methods produce an improvement in scale-invariant signal-to-distortion ratio of over 13 dB for speech/non-speech separation and close to 10 dB for universal sound separation.
Divide and Conquer: A Deep CASA Approach to Talker-Independent Monaural Speaker Separation
- Computer ScienceIEEE/ACM Transactions on Audio, Speech, and Language Processing
- 2019
The proposed deep CASA approach optimizes frame-level separation and speaker tracking in turn, and produces excellent results for both objectives, with a modest model size.