The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks

  title={The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks},
  author={Darius Petermann and Gordon Wichern and Zhong-Qiu Wang and Jonathan Le Roux},
  journal={ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
The cocktail party problem aims at isolating any source of interest within a complex acoustic scene, and has long inspired audio source separation research. Recent efforts have mainly focused on separating speech from noise, speech from speech, musical instruments from each other, or sound events from each other. However, separating an audio mixture (e.g., movie soundtrack) into the three broad categories of speech, music, and sound effects (understood to include ambient noise and natural sound… 

Figures and Tables from this paper

Disentangling the Impacts of Language and Channel Variability on Speech Separation Networks

A new solution for channel mismatch is provided by evaluating projection, where the channel similarity can be measured and used to effectively select additional training data to improve the performance of in-the-wild test data.

Towards End-to-end Speaker Diarization in the Wild

It is shown that an attractor-based end-to-end system can also perform remarkably well in the latter scenario when pre-trained on a carefully-designed simulated dataset that matches the distribution of in-the-wild recordings.

Conformer Space Neural Architecture Search for Multi-Task Audio Separation

This paper quantitatively analyze the redundancy of the EAD-Conformer network and proposes an efficient K-path search method to search for the optimal architectures from the Conformer-based search space that outperform existing methods in terms of efficiency and effectiveness.

WA-Transformer: Window Attention-based Transformer with Two-stage Strategy for Multi-task Audio Source Separation

This paper designs a two-dimensional window attention block with dilation and proposes a window attention-based Transformer network (named WA-Transformer) for multi-task audio source separation and adopts self-attention and window attention blocks to model global dependencies and local correlation in a parameter-efficient way.

Tiny-Sepformer: A Tiny Time-Domain Transformer Network for Speech Separation

Tiny-Sepformer is proposed, a tiny version of Transformer network for speech separation that could greatly reduce the model size, and achieves comparable separation performance with vanilla Sepformer on WSJ0-2/3Mix datasets.



WHAM!: Extending Speech Separation to Noisy Environments

The WSJ0 Hipster Ambient Mixtures dataset is created, consisting of two speaker mixtures from the wsj0-2mix dataset combined with real ambient noise samples, to benchmark various speech separation architectures and objective functions to evaluate their robustness to noise.

Improving Universal Sound Separation Using Sound Classification

This paper shows that semantic embeddings extracted from a sound classifier can be used to condition a separation network, providing it with useful additional information, and establishes a new state-of-the-art for universal sound separation.

Multi-Task Audio Source Separation

A new multi-task audio source separation (MTASS) challenge to separate the speech, music, and noise signals from the monaural mixture is launched and an MTASS model in the complex domain is proposed to fully utilize the differences in spectral characteristics of the three audio signals.

Open-Unmix - A Reference Implementation for Music Source Separation

Open-Unmix provides implementations for the most popular deep learning frameworks, giving researchers a flexible way to reproduce results and provides a pre-trained model for end users and even artists to try and use source separation.

Finding Strength in Weakness: Learning to Separate Sounds With Weak Supervision

This work proposes objective functions and network architectures that enable training a source separation system with weak labels and benchmarks the performance of the algorithm using synthetic mixtures of overlapping events created from a database of sounds recorded in urban environments.

Universal Sound Separation

A dataset of mixtures containing arbitrary sounds is developed, and the best methods produce an improvement in scale-invariant signal-to-distortion ratio of over 13 dB for speech/non-speech separation and close to 10 dB for universal sound separation.

Artificially Synthesising Data for Audio Classification and Segmentation to Improve Speech and Music Detection in Radio Broadcast

  • S. VenkateshD. Moffat E. Miranda
  • Computer Science
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
The data synthesis procedure is demonstrated as a highly effective technique to generate large datasets to train deep neural networks for audio segmentation and outperformed state-of-the-art algorithms for music-speech detection.

All-in-One Transformer: Unifying Speech Recognition, Audio Tagging, and Event Detection

This work demonstrates that the AIO Transformer achieves better performance compared to all baseline systems of various recent DCASE challenge tasks and is suitable for the total transcription of an acoustic scene, i.e., to simultaneously transcribe speech and recognize the acoustic events occurring in it.

Weakly Informed Audio Source Separation

A separation model is proposed that can nevertheless exploit weak information for the separation task while aligning it on the mixture as a byproduct using an attention mechanism and is demonstrated on a singing voice separation task exploiting artificial side information with different levels of expressiveness.

Continuous Speech Separation: Dataset and Analysis

  • Zhuo ChenT. Yoshioka Jinyu Li
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
A new real recording dataset, called LibriCSS, is derived from LibriSpeech by concatenating the corpus utterances to simulate conversations and capturing the audio replays with far-field microphones, which helps researchers from developing systems that can be readily applied to real scenarios.