Multi-Speaker ASR Combining Non-Autoregressive Conformer CTC and Conditional Speaker Chain
@inproceedings{Guo2021MultiSpeakerAC, title={Multi-Speaker ASR Combining Non-Autoregressive Conformer CTC and Conditional Speaker Chain}, author={Pengcheng Guo and Xuankai Chang and Shinji Watanabe and Lei Xie}, booktitle={Interspeech}, year={2021} }
Non-autoregressive (NAR) models have achieved a large inference computation reduction and comparable results with autoregressive (AR) models on various sequence to sequence tasks. However, there has been limited research aiming to explore the NAR approaches on sequence to multi-sequence problems, like multi-speaker automatic speech recognition (ASR). In this study, we extend our proposed conditional chain model to NAR multi-speaker ASR. Specifically, the output of each speaker is inferred one-by…
6 Citations
Boundary and Context Aware Training for CIF-Based Non-Autoregressive End-to-End ASR
- Computer Science2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
- 2021
The CTC spike information is used to guide the leaning of acoustic boundary and a new contextual decoder is adopted to capture the linguistic dependencies within a sentence in the conventional CIF model to improve performance and eliminate drawbacks.
Minimum Word Error Training For Non-Autoregressive Transformer-Based Code-Switching ASR
- Computer ScienceICASSP
- 2022
This paper proposes various approaches to boosting the performance of a CTC-mask-based non-autoregressive Transformer under code-switching ASR scenario, and employs Minimum Word Error criterion to train the model.
M2Met: The Icassp 2022 Multi-Channel Multi-Party Meeting Transcription Challenge
- Computer ScienceICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2022
The AliMeeting corpus, which consists of 120 hours of recorded Mandarin meeting data, is made available and the ICASSP 2022 Multi-channel Multi-party Meeting Transcription Challenge (M2MeT) is launched with two tracks, namely speaker diarization and multi-speaker ASR, aiming to provide a common testbed for meeting rich transcription and promote reproducible research in this field.
A Survey on Non-Autoregressive Generation for Neural Machine Translation and Beyond
- Computer ScienceArXiv
- 2022
This survey conducts a systematic survey with comparisons and discussions of various non-autoregressive translation (NAT) models from different aspects, and categorizes the efforts of NAT into several groups, including data manipulation, modeling methods, training criterion, decoding algorithms, and the benefit from pre-trained models.
Summary on the ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Grand Challenge
- Computer ScienceICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2022
The ICASSP 2022 Multi-channel Multi-party Meeting Transcription Grand Challenge (M2MeT) focuses on one of the most valuable and the most challenging scenarios of speech technologies, and releases 120 hours of real-recorded Mandarin meeting speech data with manual annotation.
Extended Graph Temporal Classification for Multi-Speaker End-to-End ASR
- Computer ScienceICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2022
An extension of GTC to model the posteriors of both labels and label transitions by a neural network, which can be applied to a wider range of tasks, and achieves promising results with performance close to classical benchmarks for the multi-speaker speech recognition task.
References
SHOWING 1-10 OF 38 REFERENCES
End-To-End Multi-Speaker Speech Recognition With Transformer
- Computer Science, PhysicsICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
This work replaces the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture, and incorporates an external dereverberation preprocessing, the weighted prediction error (WPE), enabling the model to handle reverberated signals.
End-to-end Monaural Multi-speaker ASR System without Pretraining
- Computer ScienceICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
The experiments demonstrate that the proposed methods can improve the performance of the end-to-end model in separating the overlapping speech and recognizing the separated streams and leads to ∼ 10.0% relative performance gains in terms of CER and WER respectively.
Speaker-Conditional Chain Model for Speech Separation and Extraction
- Computer ScienceINTERSPEECH
- 2020
This work raises a common strategy named Speaker-Conditional Chain Model to process complex speech recordings that infers the identities of variable numbers of speakers from the observation based on a sequence-to-sequence model and takes the information from the inferred speakers as conditions to extract their speech sources.
Neural Speaker Diarization with Speaker-Wise Chain Rule
- LinguisticsArXiv
- 2020
Experimental results show that the proposed speaker-wise conditional inference method can correctly produce diarization results with a variable number of speakers and outperforms the state-of-the-art end-to-end speaker diarizations methods in terms of diarized error rate.
Single-Channel Multi-talker Speech Recognition with Permutation Invariant Training
- Computer ScienceSpeech Commun.
- 2018
Improved Mask-CTC for Non-Autoregressive End-to-End ASR
- Computer ScienceICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2021
This work proposes to enhance the encoder network architecture by employing a recently proposed architecture called Conformer, and proposes new training and decoding methods by introducing auxiliary objective to predict the length of a partial target sequence, which allows the model to delete or insert tokens during inference.
A Purely End-to-End System for Multi-speaker Speech Recognition
- Computer ScienceACL
- 2018
Experimental results show that the model is directly able to learn a mapping from a speech mixture to multiple label sequences, achieving 83.1% relative improvement compared to a model trained without the proposed objective.
Listen and Fill in the Missing Letters: Non-Autoregressive Transformer for Speech Recognition
- Computer ScienceArXiv
- 2019
Results on Mandarin (Aishell) and Japanese ASR benchmarks show the possibility to train such a non-autoregressive network for ASR and it matches the performance of the state-of-the-art autoregressive transformer with 7x speedup.
CASS-NAT: CTC Alignment-Based Single Step Non-Autoregressive Transformer for Speech Recognition
- Computer ScienceICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2021
The CASS-NAT has a performance reduction on WER, but is 51.2x faster in terms of RTF, and when decoding with an oracle CTC alignment, the lower bound of WER without LM reaches 2.3%, indicating the potential of the proposed method.
Analysis of Deep Clustering as Preprocessing for Automatic Speech Recognition of Sparsely Overlapping Speech
- Computer ScienceINTERSPEECH
- 2019
This paper investigates obstacles of applying DPCL as a preprocessing method for ASR in such a scenario of sparsely overlapping speech, and presents a data simulation approach, closely related to the wsj0-2mix dataset, generating sparsely overlap speech datasets of arbitrary overlap ratio.