An End-to-end Architecture of Online Multi-channel Speech Separation

  title={An End-to-end Architecture of Online Multi-channel Speech Separation},
  author={Jian Wu and Zhuo Chen and Jinyu Li and Takuya Yoshioka and Zhili Tan and Ed Lin and Yi Luo and Lei Xie},
Multi-speaker speech recognition has been one of the keychallenges in conversation transcription as it breaks the singleactive speaker assumption employed by most state-of-the-artspeech recognition systems. Speech separation is consideredas a remedy to this problem. Previously, we introduced a sys-tem, calledunmixing,fixed-beamformerandextraction(UFE),that was shown to be effective in addressing the speech over-lap problem in conversation transcription. With UFE, an inputmixed signal is… Expand

Figures and Tables from this paper

Investigation of Practical Aspects of Single Channel Speech Separation for ASR
This paper investigates a two-stage training scheme that firstly applies a feature level optimization criterion for pretraining, followed by an ASR-oriented optimization criterion using an end-to-end (E2E) speech recognition model, and introduces a modified teacher-student learning technique for model compression. Expand
DESNet: A Multi-Channel Network for Simultaneous Speech Dereverberation, Enhancement and Separation
Experiments show that in non-dereverberated case, the proposed DESNet outperforms DCCRN and most state-of-the-art structures in speech enhancement and separation, while in dereverberation scenario, DESNet also shows improvements over the cascaded WPE-DCCRN networks. Expand
Online Self-Attentive Gated RNNs for Real-Time Speaker Separation
This study converts a non-causal state-of-the-art separation model into a causal and real-time model and evaluates its performance under both online and offline settings, shedding light on the relative difference between causal and non-Causal models when performing separation. Expand


End-to-end Monaural Multi-speaker ASR System without Pretraining
The experiments demonstrate that the proposed methods can improve the performance of the end-to-end model in separating the overlapping speech and recognizing the separated streams and leads to ∼ 10.0% relative performance gains in terms of CER and WER respectively. Expand
End-to-End SpeakerBeam for Single Channel Target Speech Recognition
SpeakerBeam has been proposed as an alternative to speech separation to mitigate the global permutation ambiguity, and interesting properties of the proposed system in terms of speech enhancement and diarization ability are discussed. Expand
Low-latency Speaker-independent Continuous Speech Separation
A low-latency SI-CSS method whose performance is comparable to that of the previous method in a microphone array-based meeting transcription task is proposed by using a new speech separation network architecture combined with a double buffering scheme and by performing enhancement with a set of fixed beamformers followed by a neural post-filter. Expand
Frequency Domain Multi-channel Acoustic Modeling for Distant Speech Recognition
New acoustic modeling techniques that optimize spatial filtering and long short-term memory (LSTM) layers from multi-channel (MC) input based on an ASR criterion directly are developed and incorporated into the acoustic model. Expand
Extract, Adapt and Recognize: An End-to-End Neural Network for Corrupted Monaural Speech Recognition
An end-to-end neural network that allows fully learnable separation and recognition components towards optimizing the ASR criterion is presented, in between a state-of-the-art speech separation module as an extractor and an acoustic modeling module as a recognizer. Expand
Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation
  • Yi Luo, N. Mesgarani
  • Computer Science, Medicine
  • IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2019
A fully convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time- domain speech separation, which significantly outperforms previous time–frequency masking methods in separating two- and three-speaker mixtures. Expand
Recognizing Overlapped Speech in Meetings: A Multichannel Separation Approach Using Neural Networks
This report is the first report that applies overlapped speech recognition to unconstrained real meeting audio and outperforms a system based on a state-of-the-art neural mask-based beamformer by 10.8%. Expand
Advances in Online Audio-Visual Meeting Transcription
A system that generates speaker-annotated transcripts of meetings by using a microphone array and a 360-degree camera and an online audio-visual speaker diarization method that leverages face tracking and identification, sound source localization, speaker identification and, if available, prior speaker information for robustness to various real world challenges is described. Expand
Multi-Channel Overlapped Speech Recognition with Location Guided Speech Extraction Network
This work proposes a simple yet effective method for multi-channel far-field overlapped speech recognition that achieves more than 24% relative word error rate (WER) reduction than fixed beamforming with oracle selection. Expand
Speech Separation Using Speaker Inventory
A novel method called speech separation using speaker inventory (SSUSI), which combines the advantages of both approaches and thus solves their problems, and outperforms permutation invariant training based blind speech separation and improves the word error rate (WER). Expand