Continuous Speech Separation with Conformer

  title={Continuous Speech Separation with Conformer},
  author={Sanyuan Chen and Yu Wu and Zhuo Chen and Jinyu Li and Chengyi Wang and Shujie Liu and M. Zhou},
  journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  • Sanyuan Chen, Yu Wu, M. Zhou
  • Published 13 August 2020
  • Computer Science
  • ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Continuous speech separation was recently proposed to deal with the overlapped speech in natural conversations. While it was shown to significantly improve the speech recognition performance for multichannel conversation transcription, its effectiveness has yet to be proven for a single-channel recording scenario. This paper examines the use of Conformer architecture in lieu of recurrent neural networks for the separation model. Conformer allows the separation model to efficiently capture both… 

Figures and Tables from this paper

Multi-microphone Complex Spectral Mapping for Utterance-wise and Continuous Speech Separation

This study first investigates offline utterance-wise speaker separation and then extends to block-online continuous speech separation, and integrates multi-microphone complex spectral mapping with minimum variance distortionless response (MVDR) beamforming and post-filtering to further improve separation.

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

A new pre-trained model, WavLM, to solve full-stack downstream speech tasks, which achieves state-of-the-art performance on the SUPERB benchmark, and brings improvements for various speech processing tasks on their representative benchmarks.

Investigation of Practical Aspects of Single Channel Speech Separation for ASR

This paper investigates a two-stage training scheme that applies a feature level optimization criterion for pre-training, followed by an ASR-oriented optimization criterion using an end-to-end (E2E) speech recognition model, and introduces a modi-student learning technique for model compression to keep the model light-weight.

Streaming Multi-Talker ASR with Token-Level Serialized Output Training

For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost, opening the door for deploying one model for both single- and multi- talker scenarios.

WA-Transformer: Window Attention-based Transformer with Two-stage Strategy for Multi-task Audio Source Separation

The standard Conformer adopts convolution layers to exploit local features. However, the one-dimensional convolution ignores the correlation of adjacent time-frequency features. In this paper, we

An Initialization Scheme for Meeting Separation with Spatial Mixture Models

Spatial mixture model (SMM) supported acoustic beamforming has been extensively used for the separation of simultaneously active speakers. However, it has hardly been considered for the separation of

Continuous Speech Separation with Recurrent Selective Attention Network

  • Yixuan ZhangZhuo Chen Jinyu Li
  • Computer Science
    ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2022
Experimental results on the LibriCSS dataset show that the RSAN-based CSS (RSAN-CSS) network consistently improves the speech recognition accuracy over PIT-based models, and a novel block-wise dependency modeling further boosts the performance of RSAn-CSS.

Independence-based Joint Dereverberation and Separation with Neural Source Model

A neural network is introduced in the framework of time-decorrelation iterative source steering, which is an extension of independent vector analysis to joint dereverberation and separation, and greatly reduces the WER on the recorded dataset LibriCSS.

Ultra Fast Speech Separation Model with Teacher Student Learning

An ultra fast speech separation Transformer model is proposed to achieve both better performance and efficiency with teacher student learning (T-S learning) and objective shifting mechanisms to guide the small student model to learn intermediate representations from the large teacher model are introduced.

CMGAN: Conformer-Based Metric-GAN for Monaural Speech Enhancement

Object and subjective evaluations illustrate that CMGAN is able to show superior performance compared to state-of-the-art methods in three speech enhancement tasks (denoising, dereverberation and super-resolution), i.e., PESQ of 3.41 and SSNR of 11.10 dB.



Speech Separation Using Speaker Inventory

A novel method called speech separation using speaker inventory (SSUSI), which combines the advantages of both approaches and thus solves their problems, and outperforms permutation invariant training based blind speech separation and improves the word error rate (WER).

Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks

In this paper, we propose the utterance-level permutation invariant training (uPIT) technique. uPIT is a practically applicable, end-to-end, deep-learning-based solution for speaker independent

Continuous Speech Separation: Dataset and Analysis

  • Zhuo ChenT. Yoshioka Jinyu Li
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
A new real recording dataset, called LibriCSS, is derived from LibriSpeech by concatenating the corpus utterances to simulate conversations and capturing the audio replays with far-field microphones, which helps researchers from developing systems that can be readily applied to real scenarios.

Low-latency Speaker-independent Continuous Speech Separation

A low-latency SI-CSS method whose performance is comparable to that of the previous method in a microphone array-based meeting transcription task is proposed by using a new speech separation network architecture combined with a double buffering scheme and by performing enhancement with a set of fixed beamformers followed by a neural post-filter.

Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation

  • Yi LuoN. Mesgarani
  • Computer Science
    IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2019
A fully convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time- domain speech separation, which significantly outperforms previous time–frequency masking methods in separating two- and three-speaker mixtures.

CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings

Of note, Track 2 is the first challenge activity in the community to tackle an unsegmented multispeaker speech recognition scenario with a complete set of reproducible open source baselines providing speech enhancement, speaker diarization, and speech recognition modules.

End-To-End Multi-Speaker Speech Recognition With Transformer

This work replaces the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture, and incorporates an external dereverberation preprocessing, the weighted prediction error (WPE), enabling the model to handle reverberated signals.

Neural Speech Synthesis with Transformer Network

This paper introduces and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2, and achieves state-of-the-art performance and close to human quality.

Permutation invariant training of deep models for speaker-independent multi-talker speech separation

This work proposes a novel deep learning training criterion, named permutation invariant training (PIT), for speaker independent multi-talker speech separation, and finds that it compares favorably to non-negative matrix factorization (NMF), computational auditory scene analysis (CASA), and DPCL and generalizes well over unseen speakers and languages.

Deep Recurrent Networks for Separation and Recognition of Single-Channel Speech in Nonstationary Background Audio

This work compares the performance of deep computational architectures with conventional statistical techniques as well as variants of nonnegative matrix factorization, and establishes that one can achieve impressively superior results with deep-learning-based techniques on this problem.