• Corpus ID: 27320286

Multi-talker Speech Separation with Utterance-level Permutation Invariant Training of Deep Recurrent Neural Networks.

@article{Kolbaek2017MultitalkerSS,
  title={Multi-talker Speech Separation with Utterance-level Permutation Invariant Training of Deep Recurrent Neural Networks.},
  author={Morten Kolbaek and Dong Yu and Z. Tan and Jesper H{\o}jvang Jensen},
  journal={arXiv: Sound},
  year={2017}
}
In this paper we propose the utterance-level Permutation Invariant Training (uPIT) technique. [] Key Method We achieve this using Recurrent Neural Networks (RNNs) that, during training, minimize the utterance-level separation error, hence forcing separated frames belonging to the same speaker to be aligned to the same output stream.

Single-Channel Multi-talker Speech Recognition with Permutation Invariant Training

Joint separation and denoising of noisy multi-talker speech using recurrent neural networks and permutation invariant training

Deep bi-directional LSTM RNNs trained using uPIT in noisy environments can achieve large SDR and ESTOI improvements, when evaluated using known noise types, and that a single model is capable of handling multiple noise types with only a slight decrease in performance.

Alleviate Cross-chunk Permutation through Chunk-level Speaker Embedding for Blind Speech Separation

This study focuses on using the speaker labels as the auxiliary supervision information to train a deep model to map the T-F embeddings of one cluster to one chunk-level speaker embedding (CL-SE).

Multi-microphone Complex Spectral Mapping for Utterance-wise and Continuous Speech Separation

This study first investigates offline utterance-wise speaker separation and then extends to block-online continuous speech separation, and integrates multi-microphone complex spectral mapping with minimum variance distortionless response (MVDR) beamforming and post-filtering to further improve separation.

Multi-channel Speech Separation Using Deep Embedding Model with Multilayer Bootstrap Networks

A variant of DPCL is proposed, named DPCL++, by applying a recent unsupervised deep learning method---multilayer bootstrap networks(MBN)---to further reduce the noise and small variations of the embedding vectors in an un supervised way in the test stage, which fascinates k-means to produce a good result.

Multi-channel Speech Separation Using Deep Embedding With Multilayer Bootstrap Networks

A variant of DPCL is proposed, named MDPCL, by applying a recent unsupervised deep learning method–multilayer bootstrap networks (MBN)–to further reduce the noise and small variations of the embedding vectors in an unsuper supervised way in the test stage, which fascinates k-means to produce a good result.

Speech Enhancement Using Self-Adaptation and Multi-Head Self-Attention

This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features; it extracts a speaker representation used for adaptation directly from the test utterance and uses multi-task learning of speech enhancement and speaker identification, and uses the output of the final hidden layer of speaker identification branch as an auxiliary feature.

Count And Separate: Incorporating Speaker Counting For Continuous Speaker Separation

  • Zhong-Qiu WangDeliang Wang
  • Physics
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
This study leverages frame-wise speaker counting to switch between speech enhancement and speaker separation for continuous speaker separation and stitches the results from the enhancement and separation models based on their predictions in a small augmented window of frames surrounding an overlapped segment.

Multi-talker ASR for an unknown number of sources: Joint training of source counting, separation and ASR

This work extends an iterative speech extraction system with mechanisms to count the number of sources and combines it with a single-talker speech recognizer to form the first end-to-end multi- talker automatic speech recognition system for an unknown number of active speakers.

Listening to Each Speaker One by One with Recurrent Selective Hearing Networks

This paper casts the source separation problem as a recursive multi-pass source extraction problem based on a recurrent neural network (RNN) that can learn and determine how many computational steps/iterations have to be performed depending on the input signals.
...

References

SHOWING 1-10 OF 59 REFERENCES

Permutation invariant training of deep models for speaker-independent multi-talker speech separation

This work proposes a novel deep learning training criterion, named permutation invariant training (PIT), for speaker independent multi-talker speech separation, and finds that it compares favorably to non-negative matrix factorization (NMF), computational auditory scene analysis (CASA), and DPCL and generalizes well over unseen speakers and languages.

A Deep Ensemble Learning Method for Monaural Speech Separation

A deep ensemble method, named multicontext networks, is proposed to address monaural speech separation and it is found that predicting the ideal time-frequency mask is more efficient in utilizing clean training speech, while predicting clean speech is less sensitive to SNR variations.

Deep Neural Networks for Single-Channel Multi-Talker Speech Recognition

This work investigates techniques based on deep neural networks for attacking the single-channel multi-talker speech recognition problem and demonstrates that the proposed DNN-based system has remarkable noise robustness to the interference of a competing speaker.

Recurrent deep stacking networks for supervised speech separation

  • Zhongqiu WangDeliang Wang
  • Computer Science
    2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2017
This study proposes a novel recurrent deep stacking approach for time-frequency masking based speech separation, where the output context is explicitly employed to improve the accuracy of mask estimation.

Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation

Joint optimization of masking functions and deep recurrent neural networks for monaural source separation tasks, including speech separation, singing voice separation, and speech denoising, and a discriminative criterion for training neural networks to further enhance the separation performance are explored.

Super-human multi-talker speech recognition: the IBM 2006 speech separation challenge system

A system for model based speech separation which achieves super-human recognition performance when two talkers speak at similar levels and incorporates a novel method for performing two-talker speaker identification and gain estimation is described.

Deep neural network based speech separation for robust speech recognition

Experimental results on a monaural speech separation and recognition challenge task show that the proposed DNN framework enhances the separation performance in terms of different objective measures under the semi-supervised mode.

Speech separation of a target speaker based on deep neural networks

Experimental results demonstrate that the proposed framework achieves better separation results than a GMM-based approach in the supervised mode, and in the semi-supervised mode which is believed to be the preferred mode in real-world operations, the DNN- based approach even outperforms the GMM/supervised approach.

Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition

A pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output that can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs.

Source–Filter-Based Single-Channel Speech Separation Using Pitch Information

A linear relationship between pitch tracking performance and speech separation performance is shown and the final combination of the source and filter model results in an utterance dependent model that finally enables speaker independent source separation.
...