Corpus ID: 235457988

A Hands-on Comparison of DNNs for Dialog Separation Using Transfer Learning from Music Source Separation

@article{Strauss2021AHC,
  title={A Hands-on Comparison of DNNs for Dialog Separation Using Transfer Learning from Music Source Separation},
  author={Martin Strauss and Jouni Paulus and Matteo Torcoli and Bernd Edler},
  journal={ArXiv},
  year={2021},
  volume={abs/2106.09093}
}
This paper describes a hands-on comparison on using state-ofthe-art music source separation deep neural networks (DNNs) before and after task-specific fine-tuning for separating speech content from non-speech content in broadcast audio (i.e., dialog separation). The music separation models are selected as they share the number of channels (2) and sampling rate (44.1 kHz or higher) with the considered broadcast content, and vocals separation in music is considered as a parallel for dialog… Expand

Figures and Tables from this paper

References

SHOWING 1-10 OF 36 REFERENCES
Open-Unmix - A Reference Implementation for Music Source Separation
TLDR
Open-Unmix provides implementations for the most popular deep learning frameworks, giving researchers a flexible way to reproduce results and provides a pre-trained model for end users and even artists to try and use source separation. Expand
Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation
  • Yi Luo, N. Mesgarani
  • Computer Science, Medicine
  • IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2019
TLDR
A fully convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time- domain speech separation, which significantly outperforms previous time–frequency masking methods in separating two- and three-speaker mixtures. Expand
Supervised Speech Separation Based on Deep Learning: An Overview
  • Deliang Wang, Jitong Chen
  • Computer Science, Medicine
  • IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2018
TLDR
This paper provides a comprehensive overview of the research on deep learning based supervised speech separation in the last several years, and provides a historical perspective on how advances are made. Expand
Adversarial Semi-Supervised Audio Source Separation Applied to Singing Voice Extraction
  • D. Stoller, S. Ewert, S. Dixon
  • Computer Science, Mathematics
  • 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
TLDR
This work adopts adversarial training for music source separation with the aim of driving the separator towards outputs deemed as realistic by discriminator networks that are trained to tell apart real from separator samples. Expand
A comprehensive study of speech separation: spectrogram vs waveform separation
TLDR
The experimental results show that spectrogram separation can achieve competitive performance with better network design, and a solution for directly optimizing the separation criterion in frequency-domain networks is introduced. Expand
Looking to listen at the cocktail party
TLDR
A deep network-based model that incorporates both visual and auditory signals to solve a single speech signal from a mixture of sounds such as other speakers and background noise, showing clear advantage over state-of-the-art audio-only speech separation in cases of mixed speech. Expand
An Efficient Model for Estimating Subjective Quality of Separated Audio Source Signals
  • T. Kastner, J. Herre
  • Computer Science
  • 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
  • 2019
TLDR
A model for prediction of the perceived audio quality of separated audio source signals is presented, solely based on two timbre features and demanding less computational effort than current perceptual measurement schemes for audio source separation. Expand
Source Separation for Enabling Dialogue Enhancement in Object-based Broadcast with MPEG-H
Dialogue Enhancement (DE) is one of the most promising applications of user interactivity enabled by object-based audio broadcasting. DE allows personalization of the relative level of dialogue forExpand
MUSDB18-HQ - an uncompressed version of MUSDB18
MUSDB18-HQ is the uncompressed version of the MUSDB18 dataset. It consists of a total of 150 full-track songs of different styles and includes both the stereo mixtures and the original sources,Expand
Spleeter: a fast and efficient music source separation tool with pre-trained models
The performance of the pre-trained models are very close to the published state-of-the-art and is one of the best performing 4 stems separation model on the common musdb18 benchmark (Rafii, Liutkus,Expand
...
1
2
3
4
...