• Corpus ID: 239016539

Improving End-To-End Modeling for Mispronunciation Detection with Effective Augmentation Mechanisms

@article{Lo2021ImprovingEM,
  title={Improving End-To-End Modeling for Mispronunciation Detection with Effective Augmentation Mechanisms},
  author={Tien-Hong Lo and Yao-Ting Sung and Berlin Chen},
  journal={ArXiv},
  year={2021},
  volume={abs/2110.08731}
}
Recently, end-to-end (E2E) models, which allow to take spectral vector sequences of L2 (second-language) learners’ utterances as input and produce the corresponding phone-level sequences as output, have attracted much research attention in developing mispronunciation detection (MD) systems. However, due to the lack of sufficient labeled speech data of L2 speakers for model estimation, E2E MD models are prone to overfitting in relation to conventional ones that are built on DNN-HMM acoustic… 

Figures and Tables from this paper

Towards Robust Mispronunciation Detection and Diagnosis for L2 English Learners with Accent-Modulating Methods
TLDR
This paper proposes an effective modeling framework that infuses accent features into an E2E MDD model, thereby making the model more accent-aware, and designs and presents disparate accent- Aware modules to perform accent- aware modulation of acoustic features in a fine-grained manner.

References

SHOWING 1-10 OF 38 REFERENCES
An Effective End-to-End Modeling Approach for Mispronunciation Detection
TLDR
This work presents a novel use of hybrid CTCAttention approach to the MD task, taking advantage of the strengths of both CTC and the attention-based model meanwhile getting around the need for phone-level forced alignment.
A Full Text-Dependent End to End Mispronunciation Detection and Diagnosis with Easy Data Augmentation Techniques
TLDR
A novel text-dependent model is presented which achieves a fully end-to-end system by aligning the audio with the phoneme sequences of the prior text inside the model through the attention mechanism, which effectively improves the ability of model to capture mispronounced phonemes.
Towards Robust Mispronunciation Detection and Diagnosis for L2 English Learners with Accent-Modulating Methods
TLDR
This paper proposes an effective modeling framework that infuses accent features into an E2E MDD model, thereby making the model more accent-aware, and designs and presents disparate accent- Aware modules to perform accent- aware modulation of acoustic features in a fine-grained manner.
Mispronunciation Detection and Diagnosis in L2 English Speech Using Multidistribution Deep Neural Networks
TLDR
An acoustic-graphemic-phonemic model (AGPM) using a multidistribution DNN, whose input features include acoustic features, as well as corresponding graphemes and canonical transcriptions (encoded as binary vectors), which develops a unified MDD framework which works much like free-phone recognition.
SED-MDD: Towards Sentence Dependent End-To-End Mispronunciation Detection and Diagnosis
TLDR
SED-MDD is the first model of its kind and it achieves an accuracy of 86.35% and a correctness of 88.61% on L2-ARCTIC which significantly outperforms the existing end-to-end mispronunciation detection and diagnosis (MD&D) model CNN-RNN-CTC.
CNN-RNN-CTC Based End-to-end Mispronunciation Detection and Diagnosis
  • Wai-Kim Leung, Xunying Liu, H. Meng
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
TLDR
Using Convolutional Neural Network, Recurrent Neural Network and Connection-ist Temporal Classification to build an end-to-end speech recognition for Mispronunciation Detection and Diagnosis task, which significantly outperforms the Extended Recognition Network (ERN) and State-level Acoustic Model (S-AM).
Joint CTC-attention based end-to-end speech recognition using multi-task learning
TLDR
A novel method for end-to-end speech recognition to improve robustness and achieve fast convergence by using a joint CTC-attention model within the multi-task learning framework, thereby mitigating the alignment issue.
An Improved Goodness of Pronunciation (GoP) Measure for Pronunciation Evaluation with DNN-HMM System Considering HMM Transition Probabilities
TLDR
This work derives a formulation for the GoP and it results in the formulation involving both senone posteriors and STPs, and the highest improvement in the correlation coefficient between the scores from the formulations and the expert ratings is found to be 14.89% better.
End-to-end attention-based large vocabulary speech recognition
TLDR
This work investigates an alternative method for sequence modelling based on an attention mechanism that allows a Recurrent Neural Network (RNN) to learn alignments between sequences of input frames and output labels.
Attention-Based Models for Speech Recognition
TLDR
The attention-mechanism is extended with features needed for speech recognition and a novel and generic method of adding location-awareness to the attention mechanism is proposed to alleviate the issue of high phoneme error rate.
...
1
2
3
4
...