E2E-Based Multi-Task Learning Approach to Joint Speech and Accent Recognition

@inproceedings{Zhang2021E2EBasedML,
  title={E2E-Based Multi-Task Learning Approach to Joint Speech and Accent Recognition},
  author={Jicheng Zhang and Yizhou Peng and Van Tung Pham and Haihua Xu and Hao Huang and Chng Eng Siong},
  booktitle={Interspeech},
  year={2021}
}
In this paper, we propose a single multi-task learning framework to perform End-to-End (E2E) speech recognition (ASR) and accent recognition (AR) simultaneously. The proposed framework is not only more compact but can also yield comparable or even better results than standalone systems. Specifically, we found that the overall performance is predominantly determined by the ASR task, and the E2E-based ASR pretraining is essential to achieve improved performance, particularly for the AR task… 

Figures and Tables from this paper

Intermediate-layer output Regularization for Attention-based Speech Recognition with Shared Decoder

As both encoder and decoder are simultaneously “reg-ularized”, the network is more sufficiently trained, consistently leading to improved results, over the ILO-based CTC method, as well as over the original attention-based modeling method without the proposed method employed.

Multilingual Approach to Joint Speech and Accent Recognition with DNN-HMM Framework

  • Yizhou PengJicheng Zhang Chng Eng Siong
  • Computer Science, Linguistics
    2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
  • 2021
Experimental results on an 8-accent English speech recognition show both methods can yield WERs close to the conventional ASR systems that completely ignore the accent, as well as desired AR accuracy.

Layer-Wise Fast Adaptation for End-to-End Multi-Accent Speech Recognition

This work aims to improve multi-accent speech recognition in the end-to-end (E2E) framework with a novel layer-wise adaptation architecture with the accurate accent embedding outperforms the other traditional methods, and obtains consistent $\sim$15% relative word error rate (WER) reduction on all kinds of testing scenarios.

Improving the transferability of speech separation by meta-learning

With the meta-learning based methods, it is discovered that only using speech data with one accent, the native English accent, as the authors' training data, the models still can be adapted to new unseen accents on the Speech Accent Archive, and the MAML methods outperform typical transfer learning methods when it comes to new accents, new speakers, new languages, and noisy environments.

Multi-Task End-to-End Model for Telugu Dialect and Speech Recognition

A unified multi-dialect End-to-End ASR is built that removes the need for a dialect recognition block and the need to maintain multiple dialect-specific ASRs for three Telugu regional dialects: Telangana, Coastal Andhra, and Rayalaseema.

Linguistic-Acoustic Similarity Based Accent Shift for Accent Recognition

LASAS is proposed, which concatenates the accent shift with a dimension-reduced text vector to obtain a linguistic-acoustic bimodal representation, which is richer and more clear by taking full advantage of both linguistic and acoustic information, which can effectively improve AR performance.

Minimum Word Error Training For Non-Autoregressive Transformer-Based Code-Switching ASR

This paper proposes various approaches to boosting the performance of a CTC-mask-based non-autoregressive Transformer under code-switching ASR scenario, and employs Minimum Word Error criterion to train the model.

Improving Vietnamese Accent Recognition Using ASR Transfer Learning

This paper proposes a transfer learning method using pretrained ASR models for Vietnamese accent recognition that helps the system utilize available speech recognition systems while capturing implicit linguistic and phonetic information learned in ASR to improve its performance.

Transducer-based language embedding for spoken language identification

Experimental results showed the proposed method improves the performance on LID tasks with 12% to 59% and 16% to 24% relative improvement on in-domain and cross-domain datasets, respectively.

PM-MMUT: Boosted Phone-mask Data Augmentation using Multi-Modeling Unit Training for Phonetic-Reduction-Robust E2E Speech Recognition

Experimental results on Uyghur ASR show that the proposed MMUT approaches outperform obviously the pure PMT, and experiments on the 960-hour Librispeech benchmark using ESPnet1, which achieves about 10% relative WER reduction on all the test set without LM fusion comparing with the latest official ESP net1 pre-trained model.

References

SHOWING 1-10 OF 32 REFERENCES

Joint Phoneme-Grapheme Model for End-To-End Speech Recognition

  • Yotaro KuboM. Bacchiani
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
A joint model is proposed based on "iterative refinement" where dependency modeling is achieved by a multi-pass strategy and performance of a conventional multi-task approach is contrasted with that of the joint model with iterative refinement.

A Multi-Accent Acoustic Model Using Mixture of Experts for Speech Recognition

This work proposes a novel acoustic model architecture based on Mixture of Experts (MoE) which works well on multiple accents without having the overhead of training separate models for separate accents.

Improved Accented Speech Recognition Using Accent Embeddings and Multi-task Learning

A multi-task architecture that jointly learns an accent classifier and a multi-accent acoustic model is proposed and augmenting the speech input with accent information in the form of embeddings extracted by a separate network is considered.

Multi-Accent Adaptation Based on Gate Mechanism

This work proposes using accent-specific top layer with gate mechanism (AST-G) to realize multi-accent adaptation, and applies using an accent classifier to predict the accent label to jointly train the acoustic model and the accent classifiers.

Large-Scale End-to-End Multilingual Speech Recognition and Language Identification with Multi-Task Learning

A large-scale end-to-end languageindependent multilingual model for joint automatic speech recognition (ASR) and language identification (LID) and achieves word error rate (WER) of 52.8 and LID accuracy of 93.5 on 42 languages with around 5000 hours of training data is reported.

Cross-Language Transfer Learning, Continuous Learning, and Domain Adaptation for End-to-End Automatic Speech Recognition

This paper demonstrates the efficacy of transfer learning and continuous learning for various automatic speech recognition (ASR) tasks and shows that in all three cases, transfer learning from a good base model has higher accuracy than a model trained from scratch.

Joint Modeling of Accents and Acoustics for Multi-Accent Speech Recognition

Experiments on the American English Wall Street Journal and British English Cambridge corpora demonstrate that the joint model outperforms the strong multi-task acoustic model baseline, and illustrates that jointly modeling with accent information improves acoustic model performance.

AISpeech-SJTU Accent Identification System for the Accented English Speech Recognition Challenge

The AISpeech-SJTU system is ranked first in the challenge and outperforms all the other participants by a large margin, and the test time augmentation and embedding fusion schemes to further improve the system performance are proposed.

Language independent end-to-end architecture for joint language identification and speech recognition

This paper presents a model that can recognize speech in 10 different languages, by directly performing grapheme (character/chunked-character) based speech recognition, based on the hybrid attention/connectionist temporal classification (CTC) architecture.

Accent Identification by Combining Deep Neural Networks and Recurrent Neural Networks Trained on Long and Short Term Features

A combination of long-term and short-term training is proposed in this paper for automatic identification of foreign accents, and the performance greatly surpasses the provided baseline system.