PM-MMUT: Boosted Phone-mask Data Augmentation using Multi-Modeling Unit Training for Phonetic-Reduction-Robust E2E Speech Recognition

  title={PM-MMUT: Boosted Phone-mask Data Augmentation using Multi-Modeling Unit Training for Phonetic-Reduction-Robust E2E Speech Recognition},
  author={Guodong Ma and Pengfei Hu and Nurmemet Yolwas and Shen Huang and Hao Huang},
Consonant and vowel reduction are often encountered in speech, which might cause performance degradation in automatic speech recognition (ASR). Our recently proposed learning strategy based on masking, Phone Masking Training (PMT), alleviates the impact of such phenomenon in Uyghur ASR. Although PMT achieves remarkably improvements, there still exists room for further gains due to the granularity mismatch between the masking unit of PMT (phoneme) and the modeling unit (word-piece). To boost the… 
1 Citations

Figures and Tables from this paper

A Policy-based Approach to the SpecAugment Method for Low Resource E2E ASR

A policy-based SpecAugment (Policy-SpecAugment) method to alleviate the above problem and aims to encourage the model to learn more diverse data, which the model relatively requires.



Leveraging Phone Mask Training for Phonetic-Reduction-Robust E2E Uyghur Speech Recognition

In Uyghur speech, consonant and vowel reduction are often en-countered, especially in spontaneous speech with high speech rate, which will cause a degradation of speech recognition performance. To

Semantic Mask for Transformer based End-to-End Speech Recognition

This paper proposes a semantic mask based regularization for training such kind of end-to-end (E2E) model, which is to mask the input features corresponding to a particular output token in order to encourage the model to fill the token based on the contextual information.

Joint Phoneme-Grapheme Model for End-To-End Speech Recognition

  • Yotaro KuboM. Bacchiani
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
A joint model is proposed based on "iterative refinement" where dependency modeling is achieved by a multi-pass strategy and performance of a conventional multi-task approach is contrasted with that of the joint model with iterative refinement.

Decoupling Pronunciation and Language for End-to-End Code-Switching Automatic Speech Recognition

A decoupled transformer model to use mono-lingual paired data and unpaired text data to alleviate the problem of code-switching data shortage and is evaluated on the public Mandarin-English code- Switching dataset.

MixSpeech: Data Augmentation for Low-Resource Automatic Speech Recognition

Experimental results show that MixSpeech achieves better accuracy than the baseline models without data augmentation, and outperforms a strongData augmentation method SpecAugment on these recognition tasks.

Data Augmentation Methods for End-to-end Speech Recognition on Distant-Talk Scenarios

Data augmentation methods for E2E ASR in distanttalk scenarios are investigated and each augmentation method individually improves the accuracy on top of the conventional SpecAugment; further improvements are obtained by combining these approaches.

An Investigation of Using Hybrid Modeling Units for Improving End-to-End Speech Recognition System

Using a hybrid of the syllable, Chinese character, and subword as the modeling units for the end-to-end speech recognition system based on the CTC/attention multi-task learning can achieve better performances than the conventional units of char-subword, and 6.6% relative CER reduction on 1200-hour data.

Hierarchical Multitask Learning for CTC-based Speech Recognition

It is observed that the hierarchical multitask approach improves over standard multitask training in higher-data experiments, while in the low-resource settings standard multitasks training works well.

Joint CTC-attention based end-to-end speech recognition using multi-task learning

A novel method for end-to-end speech recognition to improve robustness and achieve fast convergence by using a joint CTC-attention model within the multi-task learning framework, thereby mitigating the alignment issue.

Multitask Learning with Low-Level Auxiliary Tasks for Encoder-Decoder Based Speech Recognition

This work hypothesizes that using intermediate representations as auxiliary supervision at lower levels of deep networks may be a good way of combining the advantages of end-to-end training and more traditional pipeline approaches.