Corpus ID: 211817858

Voice Separation with an Unknown Number of Multiple Speakers

@inproceedings{Nachmani2020VoiceSW,
  title={Voice Separation with an Unknown Number of Multiple Speakers},
  author={Eliya Nachmani and Yossi Adi and Lior Wolf},
  booktitle={ICML},
  year={2020}
}
We present a new method for separating a mixed audio sequence, in which multiple voices speak simultaneously. The new method employs gated neural networks that are trained to separate the voices at multiple processing steps, while maintaining the speaker in each output channel fixed. A different model is trained for every number of possible speakers, and the model with the largest number of speakers is employed to select the actual number of speakers in a given sample. Our method greatly… Expand
Sequence to Multi-Sequence Learning via Conditional Chain Mapping for Mixture Signals
TLDR
This work extends the standard sequence-to-sequence model to a conditional multi- sequence model, which explicitly models the relevance between multiple output sequences with the probabilistic chain rule, and can conditionally infer output sequences one-by-one by making use of both input and previously-estimated contextual output sequences. Expand
Multi-Decoder Dprnn: Source Separation for Variable Number of Speakers
  • Junzhe Zhu, Raymond A. Yeh, M. Hasegawa-Johnson
  • Computer Science
  • ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
TLDR
The approach extends the MulCat source separation backbone with additional output heads: a count-head to infer the number of speakers, and decoder-heads for reconstructing the original signals to clear up the issue on how to evaluate the quality when the ground-truth has more or less speakers than the ones predicted by the model. Expand
Sandglasset: A Light Multi-Granularity Self-Attentive Network for Time-Domain Speech Separation
  • Max W. Y. Lam, J. Wang, Dan Su, Dong Yu
  • Computer Science, Engineering
  • ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
TLDR
This work introduces a self-attentive network with a novel sandglass-shape, namely Sandglasset, which advances the state-of-the-art (SOTA) SS performance at significantly smaller model size and computational cost. Expand
Single Channel Voice Separation for Unknown Number of Speakers Under Reverberant and Noisy Settings
TLDR
A unified network for voice separation of an unknown number of speakers is presented and it is suggested that the proposed approach is superior to the baseline model by a significant margin. Expand
Tune-In: Training Under Negative Environments with Interference for Attention Networks Simulating Cocktail Party Effect
TLDR
It turns out that substantially discriminative and generalizable speaker representations can be learnt in severely interfered conditions via self-supervised training, and Tune-In achieves remarkably better speech separation performances in terms of SI-SNRi and SDRi consistently in all test modes, and especially at lower memory and computational consumption, than state-of-the-art. Expand
Multi-Decoder DPRNN: High Accuracy Source Counting and Separation
TLDR
This work proposes an end-to-end trainable approach to single-channel speech separation with unknown number of speakers by extending the MulCat source separation backbonewith additional output heads: a count-head to infer the num of speakers, and decoder-heads for reconstructing the original signals. Expand
Toward the pre-cocktail party problem with TasTas+
TLDR
A new approach to monaural speech separation in pre-cocktail party problems is proposed, called TasTas+$, which takes the mixed utterance of five speakers and map it to five separated utterances, where each utterance contains only one speaker's voice. Expand
Many-Speakers Single Channel Speech Separation with Optimal Permutation Training
TLDR
This work presents a permutation invariant training that employs the Hungarian algorithm in order to train with anO(C) time complexity, whereC is the number of speakers, in comparison to O(C!) of PIT based methods. Expand
Online Self-Attentive Gated RNNs for Real-Time Speaker Separation
TLDR
This study converts a non-causal state-of-the-art separation model into a causal and real-time model and evaluates its performance under both online and offline settings, shedding light on the relative difference between causal and non-Causal models when performing separation. Expand
SAGRNN: Self-Attentive Gated RNN For Binaural Speaker Separation With Interaural Cue Preservation
TLDR
This study extends a newly-developed gated recurrent neural network for monaural separation by additionally incorporating self-attention mechanisms and dense connectivity and develops an end-to-end multiple-input multiple-output system, which directly maps from the binaural waveform of the mixture to those of the speech signals. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 58 REFERENCES
Permutation invariant training of deep models for speaker-independent multi-talker speech separation
TLDR
This work proposes a novel deep learning training criterion, named permutation invariant training (PIT), for speaker independent multi-talker speech separation, and finds that it compares favorably to non-negative matrix factorization (NMF), computational auditory scene analysis (CASA), and DPCL and generalizes well over unseen speakers and languages. Expand
Deep clustering: Discriminative embeddings for segmentation and separation
TLDR
Preliminary experiments on single-channel mixtures from multiple speakers show that a speaker-independent model trained on two-speaker mixtures can improve signal quality for mixtures of held-out speakers by an average of 6dB, and the same model does surprisingly well with three-speakers mixtures. Expand
Music Source Separation in the Waveform Domain
TLDR
Demucs is proposed, a new waveform-to-waveform model, which has an architecture closer to models for audio generation with more capacity on the decoder, and human evaluations show that Demucs has significantly higher quality than Conv-Tasnet, but slightly more contamination from other sources, which explains the difference in SDR. Expand
VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking
TLDR
A novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker, by training two separate neural networks. Expand
Alternative Objective Functions for Deep Clustering
TLDR
The best proposed method achieves a state-of-the-art 11.5 dB signal-to-distortion ratio result on the publicly available wsj0-2mix dataset, with a much simpler architecture than the previous best approach. Expand
End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction
This paper proposes an end-to-end approach for single-channel speaker-independent multi-speaker speech separation, where time-frequency (T-F) masking, the short-time Fourier transform (STFT), and itsExpand
TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation
  • Yi Luo, N. Mesgarani
  • Computer Science, Engineering
  • 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
TLDR
Time-domain Audio Separation Network (TasNet) is proposed, which outperforms the current state-of-the-art causal and noncausal speech separation algorithms, reduces the computational cost of speech separation, and significantly reduces the minimum required latency of the output. Expand
The 2018 Signal Separation Evaluation Campaign
TLDR
This year's edition of SiSEC was focused on audio and pursued the effort towards scaling up and making it easier to prototype audio separation software in an era of machine-learning based systems, including a new music separation database: MUSDB18. Expand
Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation
  • Yi Luo, Zhuo Chen, T. Yoshioka
  • Computer Science, Engineering
  • ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
TLDR
Experiments show that by replacing 1-D CNN with DPRNN and apply sample-level modeling in the time-domain audio separation network (TasNet), a new state-of-the-art performance on WSJ0-2mix is achieved with a 20 times smaller model than the previous best system. Expand
End-to-end Microphone Permutation and Number Invariant Multi-channel Speech Separation
TLDR
This paper proposes transform-average-concatenate (TAC), a simple design paradigm for channel permutation and number invariant multi-channel speech separation based on the filter-and-sum network, and shows how TAC significantly improves the separation performance across various numbers of microphones in noisy reverberant separation tasks with ad-hoc arrays. Expand
...
1
2
3
4
5
...