CoCA-MDD: A Coupled Cross-Attention based Framework for Streaming Mispronunciation Detection and Diagnosis

  title={CoCA-MDD: A Coupled Cross-Attention based Framework for Streaming Mispronunciation Detection and Diagnosis},
  author={Nianzu Zheng and Liqun Deng and Wen-Chin Huang and Yu Ting Yeung and Baohua Xu and Yuanyuan Guo and Yasheng Wang and Xiao Chen and Xin Jiang and Qun Liu},
  journal={Interspeech 2022},
Mispronunciation detection and diagnosis (MDD) is a pop-ular research focus in computer-aided pronunciation training (CAPT) systems. End-to-end (e2e) approaches are becoming dominant in MDD. However an e2e MDD model usually requires entire speech utterances as input context, which leads to significant time latency especially for long paragraphs. We propose a streaming e2e MDD model called CoCA-MDD. We utilize conv-transformer structure to encode input speech in a streaming manner. A coupled… 

Figures and Tables from this paper



L2-ARCTIC: A Non-native English Speech Corpus

L2-ARCTIC is introduced, a speech corpus of non-native English that is intended for research in voice conversion, accent conversion, and mispronunciation detection, and is publicly accessible at

Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

This paper presents a novel method for training RNNs to label unsegmented sequences directly, thereby solving both problems of sequence learning and post-processing.

speechocean762: An Open-Source Non-native English Speech Corpus For Pronunciation Assessment

A new open-source speech corpus named “speechocean762” designed for pronunciation assessment use, consisting of 5000 English utterances from 250 non-native speakers, where half of the speakers are children.

A Study on Fine-Tuning wav2vec2.0 Model for the Task of Mispronunciation Detection and Diagnosis

This paper explores the use of Self-Supervised Pretraining (SSP) model wav2vec2.0 for MDD tasks and demon-strates the effectiveness of SSP on MDD 1.

End-to-End Mispronunciation Detection and Diagnosis From Raw Waveforms

A fully end-to-end (E2E) neural model for MDD, which processes learners' speech directly based on raw waveforms and can achieve comparable mispronunciation detection performance in relation to state-of-the-art E2E MDD models that take input the standard handcrafted acoustic features.

SED-MDD: Towards Sentence Dependent End-To-End Mispronunciation Detection and Diagnosis

SED-MDD is the first model of its kind and it achieves an accuracy of 86.35% and a correctness of 88.61% on L2-ARCTIC which significantly outperforms the existing end-to-end mispronunciation detection and diagnosis (MD&D) model CNN-RNN-CTC.

Towards Fast and Accurate Streaming End-To-End ASR

  • Bo LiShuo-yiin Chang Yonghui Wu
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
This work proposes to reduce E2E model’s latency by extending the RNN-T endpointer (RNN- T EP) model with additional early and late penalties and achieves 8.0% relative word error rate (WER) reduction and 130ms 90-percentile latency reduction over [2] on a Voice Search test set.

Normalization of GOP for Chinese Mispronunciation Detection

  • Wenwei DongYanlu Xie
  • Computer Science
    2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
  • 2019
Two ways to normalize GOP scores are proposed, to separate the GOP calculation of Chinese Initials and those of Chinese Finals, and to use the corresponding native pronunciation score as a template to scale the non-native one.

CNN-RNN-CTC Based End-to-end Mispronunciation Detection and Diagnosis

  • Wai-Kim LeungXunying LiuH. Meng
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
Using Convolutional Neural Network, Recurrent Neural Network and Connection-ist Temporal Classification to build an end-to-end speech recognition for Mispronunciation Detection and Diagnosis task, which significantly outperforms the Extended Recognition Network (ERN) and State-level Acoustic Model (S-AM).