Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks

@article{Kolbaek2017MultitalkerSS,
  title={Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks},
  author={Morten Kolbaek and Dong Yu and Z. Tan and Jesper H{\o}jvang Jensen},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  year={2017},
  volume={25},
  pages={1901-1913}
}
In this paper, we propose the utterance-level permutation invariant training (uPIT) technique. [...] Key Method We achieve this using recurrent neural networks (RNNs) that, during training, minimize the utterance-level separation error, hence forcing separated frames belonging to the same speaker to be aligned to the same output stream.Expand
Single Channel Speech Separation with Constrained Utterance Level Permutation Invariant Training Using Grid LSTM
TLDR
A constrained uPIT (cuPIT) is proposed to solve the label ambiguity problem by computing a weighted MSE loss using dynamic information (i.e., delta and acceleration) to ensure the temporal continuity of output frames with the same speaker.
Single-Channel Multi-talker Speech Recognition with Permutation Invariant Training
TLDR
This paper proposes and evaluates several architectures to address the multi-talker mixed speech recognition problem under the assumption that only a single channel of mixed signal is available, and elegantly solves the label permutation problem observed in the deep learning based multi- talker mixedspeech separation and recognition systems.
Joint separation and denoising of noisy multi-talker speech using recurrent neural networks and permutation invariant training
TLDR
Deep bi-directional LSTM RNNs trained using uPIT in noisy environments can achieve large SDR and ESTOI improvements, when evaluated using known noise types, and that a single model is capable of handling multiple noise types with only a slight decrease in performance.
Separating Long-Form Speech with Group-Wise Permutation Invariant Training
TLDR
A novel training scheme named Group-PIT is proposed, which allows direct training of the speech separation models on the long-form speech with a low computational cost for label assignment and demonstrates the effectiveness of the proposed approaches, especially in dealing with a very long speech input.
Furcax: End-to-end Monaural Speech Separation Based on Deep Gated (De)convolutional Neural Networks with Adversarial Example Training
TLDR
This paper presents an integrated simple and effective end-to-end approach called FurcaX1 to monaural speech separation, which consists of deep gated (de)convolutional neural networks (GCNN) that takes the mixed utterance of two speakers and maps it to two separated utterances, where each utterance contains only one speaker’s voice.
Directed Speech Separation for Automatic Speech Recognition of Long Form Conversational Speech
TLDR
A speaker conditioned separator trained on speaker embeddings extracted directly from the mixed signal is proposed which achieves significant improvements on Word error rate (WER) for real conversational data without the need for an additional restitching step.
Adaptive Permutation Invariant Training with Auxiliary Information for Monaural Multi-Talker Speech Recognition
  • Xuankai Chang, Y. Qian, Dong Yu
  • Computer Science
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
TLDR
This paper proposes to adapt the PIT models with auxiliary features such as pitch and i-vector, and to exploit the gender information with multi-task learning which jointly optimizes for the speech recognition and speaker-pair prediction.
Discriminative Learning for Monaural Speech Separation Using Deep Embedding Features
TLDR
The proposed models achieve better performances than DC and uPIT for speaker-independent speech separation and maximize the distance of each permutation, which is applied to fine tuning the whole model.
A Casa Approach to Deep Learning Based Speaker-Independent Co-Channel Speech Separation
  • Yuzhou Liu, Deliang Wang
  • Computer Science
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
TLDR
The proposed CASA approach takes advantage of permutation invariant training (PIT) and deep clustering (DC) but overcomes their shortcomings, and experiments show that the proposed system improves over the best reported results of PIT and DC.
Utterance-level Permutation Invariant Training with Latency-controlled BLSTM for Single-channel Multi-talker Speech Separation
TLDR
Using latency-controlled BLSTM (LC-BLSTM) during inference to fulfill low-latency and good-performance speech separation and it is found that the inter-chunk speaker tracing (ST) can further improve the separation performance of uPIT-LC-blSTM.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 53 REFERENCES
Permutation invariant training of deep models for speaker-independent multi-talker speech separation
TLDR
This work proposes a novel deep learning training criterion, named permutation invariant training (PIT), for speaker independent multi-talker speech separation, and finds that it compares favorably to non-negative matrix factorization (NMF), computational auditory scene analysis (CASA), and DPCL and generalizes well over unseen speakers and languages.
A Deep Ensemble Learning Method for Monaural Speech Separation
TLDR
A deep ensemble method, named multicontext networks, is proposed to address monaural speech separation and it is found that predicting the ideal time-frequency mask is more efficient in utilizing clean training speech, while predicting clean speech is less sensitive to SNR variations.
Deep Neural Networks for Single-Channel Multi-Talker Speech Recognition
TLDR
This work investigates techniques based on deep neural networks for attacking the single-channel multi-talker speech recognition problem and demonstrates that the proposed DNN-based system has remarkable noise robustness to the interference of a competing speaker.
Recurrent deep stacking networks for supervised speech separation
  • Zhongqiu Wang, Deliang Wang
  • Computer Science
    2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2017
TLDR
This study proposes a novel recurrent deep stacking approach for time-frequency masking based speech separation, where the output context is explicitly employed to improve the accuracy of mask estimation.
Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation
TLDR
Joint optimization of masking functions and deep recurrent neural networks for monaural source separation tasks, including speech separation, singing voice separation, and speech denoising, and a discriminative criterion for training neural networks to further enhance the separation performance are explored.
Single-Channel Multi-Speaker Separation Using Deep Clustering
TLDR
This paper significantly improves upon the baseline system performance by incorporating better regularization, larger temporal context, and a deeper architecture, culminating in an overall improvement in signal to distortion ratio (SDR) of 10.3 dB compared to the baseline, and produces unprecedented performance on a challenging speech separation.
Super-human multi-talker speech recognition: the IBM 2006 speech separation challenge system
TLDR
A system for model based speech separation which achieves super-human recognition performance when two talkers speak at similar levels and incorporates a novel method for performing two-talker speaker identification and gain estimation is described.
Deep neural network based speech separation for robust speech recognition
TLDR
Experimental results on a monaural speech separation and recognition challenge task show that the proposed DNN framework enhances the separation performance in terms of different objective measures under the semi-supervised mode.
Deep Recurrent Networks for Separation and Recognition of Single-Channel Speech in Nonstationary Background Audio
TLDR
This work compares the performance of deep computational architectures with conventional statistical techniques as well as variants of nonnegative matrix factorization, and establishes that one can achieve impressively superior results with deep-learning-based techniques on this problem.
Achieving Human Parity in Conversational Speech Recognition
TLDR
The human error rate on the widely used NIST 2000 test set is measured, and the latest automated speech recognition system has reached human parity, establishing a new state of the art, and edges past the human benchmark.
...
1
2
3
4
5
...