Multi-talker ASR for an unknown number of sources: Joint training of source counting, separation and ASR

@inproceedings{vonNeumann2020MultitalkerAF,
  title={Multi-talker ASR for an unknown number of sources: Joint training of source counting, separation and ASR},
  author={Thilo von Neumann and Christoph Boeddeker and Lukas Drude and Keisuke Kinoshita and Marc Delcroix and Tomohiro Nakatani and Reinhold Haeb-Umbach},
  booktitle={INTERSPEECH},
  year={2020}
}
Most approaches to multi-talker overlapped speech separation and recognition assume that the number of simultaneously active speakers is given, but in realistic situations, it is typically unknown. To cope with this, we extend an iterative speech extraction system with mechanisms to count the number of sources and combine it with a single-talker speech recognizer to form the first end-to-end multi-talker automatic speech recognition system for an unknown number of active speakers. Our… 
Exploring End-to-End Multi-Channel ASR with Bias Information for Meeting Transcription
TLDR
This work investigates the joint modeling of a mask-based beamformer and Attention-Encoder-Decoder-based ASR and proposes an effective location bias integration method called deep concatenation for the beamformer network, which achieves a substantial word error rate reduction.
Single Channel Voice Separation for Unknown Number of Speakers Under Reverberant and Noisy Settings
TLDR
A unified network for voice separation of an unknown number of speakers is presented and it is suggested that the proposed approach is superior to the baseline model by a significant margin.
Dual-Path Modeling for Long Recording Speech Separation in Meetings
  • Chenda Li, Zhuo Chen, +6 authors Y. Qian
  • Engineering, Computer Science
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
TLDR
A transformer-based dual- path system is proposed, which integrates transform layers for global modeling and significantly reduces the computation amount by 30% with better WER evaluation and the online processing dual-path models are investigated, which shows 10% relative WER reduction compared to the baseline.
ESPnet-SE: End-To-End Speech Enhancement and Separation Toolkit Designed for ASR Integration
TLDR
The design of the toolkit, several important functionalities, especially the speech recognition integration, which differentiates ESPnet-SE from other open source toolkits, and experimental results with major benchmark datasets are described.
Speaker Attractor Network: Generalizing Speech Separation to Unseen Numbers of Sources
TLDR
Experimental results show that the proposed method significantly improves the separation performance when generalizing to an unseen number of speakers, and can separate up to five speakers even the model is only trained on two-speaker mixtures.
Improving RNN Transducer with Target Speaker Extraction and Neural Uncertainty Estimation
TLDR
This work presents a joint framework that combines time-domain target-speaker speech extraction and Recurrent Neural Network Transducer and proposes a multi-stage training strategy that pre-trains and fine-tunes each module in the system before joint-training.
Investigation of Practical Aspects of Single Channel Speech Separation for ASR
TLDR
This paper investigates a two-stage training scheme that firstly applies a feature level optimization criterion for pretraining, followed by an ASR-oriented optimization criterion using an end-to-end (E2E) speech recognition model, and introduces a modified teacher-student learning technique for model compression.
Multi-microphone Complex Spectral Mapping for Utterance-wise and Continuous Speech Separation
TLDR
This study first investigates offline utterance-wise speaker separation and then extends to block-online continuous speech separation, and integrates multi-microphone complex spectral mapping with minimum variance distortionless response (MVDR) beamforming and post-filtering to further improve separation.
Multi-turn RNN-T for streaming recognition of multi-party speech
TLDR
This work proposes a novel multi-turn RNN-T (MT-RNN-t) model with an overlap-based target arrangement strategy that generalizes to an arbitrary number of speakers without changes in the model architecture and addresses a few challenges in previous work on multi-speaker recurrent neural network transducer (MS-Rnn-T).
...
1
2
...

References

SHOWING 1-10 OF 33 REFERENCES
Recursive speech separation for unknown number of speakers
TLDR
This work proposes a method of single-channel speaker-independent multi-speaker speech separation for an unknown number of speakers and proposes one-and-rest permutation invariant training (OR-PIT), which can be applied to cases with different numbers of speakers using a single model by recursively separating a speaker.
Analysis of Deep Clustering as Preprocessing for Automatic Speech Recognition of Sparsely Overlapping Speech
TLDR
This paper investigates obstacles of applying DPCL as a preprocessing method for ASR in such a scenario of sparsely overlapping speech, and presents a data simulation approach, closely related to the wsj0-2mix dataset, generating sparsely overlap speech datasets of arbitrary overlap ratio.
End-to-end Monaural Multi-speaker ASR System without Pretraining
TLDR
The experiments demonstrate that the proposed methods can improve the performance of the end-to-end model in separating the overlapping speech and recognizing the separated streams and leads to ∼ 10.0% relative performance gains in terms of CER and WER respectively.
Single-Channel Multi-talker Speech Recognition with Permutation Invariant Training
TLDR
This paper proposes and evaluates several architectures to address the multi-talker mixed speech recognition problem under the assumption that only a single channel of mixed signal is available, and elegantly solves the label permutation problem observed in the deep learning based multi- talker mixedspeech separation and recognition systems.
A Purely End-to-End System for Multi-speaker Speech Recognition
TLDR
Experimental results show that the model is directly able to learn a mapping from a speech mixture to multiple label sequences, achieving 83.1% relative improvement compared to a model trained without the proposed objective.
End-to-End Training of Time Domain Audio Separation and Recognition
TLDR
This work demonstrates how to combine a separation module based on a Convolutional Time domain Audio Separation Network (Conv-TasNet) with an E2E speech recognizer and how to train such a model jointly by distributing it over multiple GPUs or by approximating truncated back-propagation for the convolutional front-end.
Recognizing Multi-talker Speech with Permutation Invariant Training
TLDR
A novel technique for direct recognition of multiple speech streams given the single channel of mixed speech, without first separating them, based on permutation invariant training (PIT) for automatic speech recognition (ASR).
Improving End-to-End Single-Channel Multi-Talker Speech Recognition
TLDR
An enhanced end-to-end monaural multi- talker ASR architecture and training strategy to recognize the overlapped speech and demonstrates that the proposed architectures can significantly improve the multi-talker mixed speech recognition.
Serialized Output Training for End-to-End Overlapped Speech Recognition
TLDR
Experimental results on LibriSpeech corpus show that the SOT models can transcribe overlapped speech with variable numbers of speakers significantly better than PIT-based models and a simple trick that allows SOT to be executed in O(S), where $S$ is the number of the speakers in the training sample.
End-to-End Multi-Speaker Speech Recognition
TLDR
This work develops the first fully end-to-end, jointly trained deep learning system for separation and recognition of overlapping speech signals that enables training on more realistic data that contains only mixed signals and their transcriptions, and thus is suited to large scale training on existing transcribed data.
...
1
2
3
4
...