Multi-Speaker ASR Combining Non-Autoregressive Conformer CTC and Conditional Speaker Chain

@inproceedings{Guo2021MultiSpeakerAC,
  title={Multi-Speaker ASR Combining Non-Autoregressive Conformer CTC and Conditional Speaker Chain},
  author={Pengcheng Guo and Xuankai Chang and Shinji Watanabe and Lei Xie},
  booktitle={Interspeech},
  year={2021}
}
Non-autoregressive (NAR) models have achieved a large inference computation reduction and comparable results with autoregressive (AR) models on various sequence to sequence tasks. However, there has been limited research aiming to explore the NAR approaches on sequence to multi-sequence problems, like multi-speaker automatic speech recognition (ASR). In this study, we extend our proposed conditional chain model to NAR multi-speaker ASR. Specifically, the output of each speaker is inferred one-by… 

Figures and Tables from this paper

Boundary and Context Aware Training for CIF-Based Non-Autoregressive End-to-End ASR

The CTC spike information is used to guide the leaning of acoustic boundary and a new contextual decoder is adopted to capture the linguistic dependencies within a sentence in the conventional CIF model to improve performance and eliminate drawbacks.

Minimum Word Error Training For Non-Autoregressive Transformer-Based Code-Switching ASR

This paper proposes various approaches to boosting the performance of a CTC-mask-based non-autoregressive Transformer under code-switching ASR scenario, and employs Minimum Word Error criterion to train the model.

M2Met: The Icassp 2022 Multi-Channel Multi-Party Meeting Transcription Challenge

  • Fan YuShiliang Zhang Hui Bu
  • Computer Science
    ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2022
The AliMeeting corpus, which consists of 120 hours of recorded Mandarin meeting data, is made available and the ICASSP 2022 Multi-channel Multi-party Meeting Transcription Challenge (M2MeT) is launched with two tracks, namely speaker diarization and multi-speaker ASR, aiming to provide a common testbed for meeting rich transcription and promote reproducible research in this field.

A Survey on Non-Autoregressive Generation for Neural Machine Translation and Beyond

This survey conducts a systematic survey with comparisons and discussions of various non-autoregressive translation (NAT) models from different aspects, and categorizes the efforts of NAT into several groups, including data manipulation, modeling methods, training criterion, decoding algorithms, and the benefit from pre-trained models.

Summary on the ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Grand Challenge

  • Fan YuShiliang Zhang Hui Bu
  • Computer Science
    ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2022
The ICASSP 2022 Multi-channel Multi-party Meeting Transcription Grand Challenge (M2MeT) focuses on one of the most valuable and the most challenging scenarios of speech technologies, and releases 120 hours of real-recorded Mandarin meeting speech data with manual annotation.

Extended Graph Temporal Classification for Multi-Speaker End-to-End ASR

An extension of GTC to model the posteriors of both labels and label transitions by a neural network, which can be applied to a wider range of tasks, and achieves promising results with performance close to classical benchmarks for the multi-speaker speech recognition task.

References

SHOWING 1-10 OF 38 REFERENCES

End-To-End Multi-Speaker Speech Recognition With Transformer

This work replaces the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture, and incorporates an external dereverberation preprocessing, the weighted prediction error (WPE), enabling the model to handle reverberated signals.

End-to-end Monaural Multi-speaker ASR System without Pretraining

The experiments demonstrate that the proposed methods can improve the performance of the end-to-end model in separating the overlapping speech and recognizing the separated streams and leads to ∼ 10.0% relative performance gains in terms of CER and WER respectively.

Speaker-Conditional Chain Model for Speech Separation and Extraction

This work raises a common strategy named Speaker-Conditional Chain Model to process complex speech recordings that infers the identities of variable numbers of speakers from the observation based on a sequence-to-sequence model and takes the information from the inferred speakers as conditions to extract their speech sources.

Neural Speaker Diarization with Speaker-Wise Chain Rule

Experimental results show that the proposed speaker-wise conditional inference method can correctly produce diarization results with a variable number of speakers and outperforms the state-of-the-art end-to-end speaker diarizations methods in terms of diarized error rate.

Single-Channel Multi-talker Speech Recognition with Permutation Invariant Training

Improved Mask-CTC for Non-Autoregressive End-to-End ASR

This work proposes to enhance the encoder network architecture by employing a recently proposed architecture called Conformer, and proposes new training and decoding methods by introducing auxiliary objective to predict the length of a partial target sequence, which allows the model to delete or insert tokens during inference.

A Purely End-to-End System for Multi-speaker Speech Recognition

Experimental results show that the model is directly able to learn a mapping from a speech mixture to multiple label sequences, achieving 83.1% relative improvement compared to a model trained without the proposed objective.

Listen and Fill in the Missing Letters: Non-Autoregressive Transformer for Speech Recognition

Results on Mandarin (Aishell) and Japanese ASR benchmarks show the possibility to train such a non-autoregressive network for ASR and it matches the performance of the state-of-the-art autoregressive transformer with 7x speedup.

CASS-NAT: CTC Alignment-Based Single Step Non-Autoregressive Transformer for Speech Recognition

The CASS-NAT has a performance reduction on WER, but is 51.2x faster in terms of RTF, and when decoding with an oracle CTC alignment, the lower bound of WER without LM reaches 2.3%, indicating the potential of the proposed method.

Analysis of Deep Clustering as Preprocessing for Automatic Speech Recognition of Sparsely Overlapping Speech

This paper investigates obstacles of applying DPCL as a preprocessing method for ASR in such a scenario of sparsely overlapping speech, and presents a data simulation approach, closely related to the wsj0-2mix dataset, generating sparsely overlap speech datasets of arbitrary overlap ratio.