E2E-Based Multi-Task Learning Approach to Joint Speech and Accent Recognition
@inproceedings{Zhang2021E2EBasedML, title={E2E-Based Multi-Task Learning Approach to Joint Speech and Accent Recognition}, author={Jicheng Zhang and Yizhou Peng and Van Tung Pham and Haihua Xu and Hao Huang and Chng Eng Siong}, booktitle={Interspeech}, year={2021} }
In this paper, we propose a single multi-task learning framework to perform End-to-End (E2E) speech recognition (ASR) and accent recognition (AR) simultaneously. The proposed framework is not only more compact but can also yield comparable or even better results than standalone systems. Specifically, we found that the overall performance is predominantly determined by the ASR task, and the E2E-based ASR pretraining is essential to achieve improved performance, particularly for the AR task…
12 Citations
Intermediate-layer output Regularization for Attention-based Speech Recognition with Shared Decoder
- Computer ScienceArXiv
- 2022
As both encoder and decoder are simultaneously “reg-ularized”, the network is more sufficiently trained, consistently leading to improved results, over the ILO-based CTC method, as well as over the original attention-based modeling method without the proposed method employed.
Multilingual Approach to Joint Speech and Accent Recognition with DNN-HMM Framework
- Computer Science, Linguistics2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
- 2021
Experimental results on an 8-accent English speech recognition show both methods can yield WERs close to the conventional ASR systems that completely ignore the accent, as well as desired AR accuracy.
Layer-Wise Fast Adaptation for End-to-End Multi-Accent Speech Recognition
- Computer ScienceIEEE/ACM Transactions on Audio, Speech, and Language Processing
- 2022
This work aims to improve multi-accent speech recognition in the end-to-end (E2E) framework with a novel layer-wise adaptation architecture with the accurate accent embedding outperforms the other traditional methods, and obtains consistent $\sim$15% relative word error rate (WER) reduction on all kinds of testing scenarios.
Improving the transferability of speech separation by meta-learning
- Computer ScienceArXiv
- 2022
With the meta-learning based methods, it is discovered that only using speech data with one accent, the native English accent, as the authors' training data, the models still can be adapted to new unseen accents on the Speech Accent Archive, and the MAML methods outperform typical transfer learning methods when it comes to new accents, new speakers, new languages, and noisy environments.
Multi-Task End-to-End Model for Telugu Dialect and Speech Recognition
- Computer Science, LinguisticsINTERSPEECH
- 2022
A unified multi-dialect End-to-End ASR is built that removes the need for a dialect recognition block and the need to maintain multiple dialect-specific ASRs for three Telugu regional dialects: Telangana, Coastal Andhra, and Rayalaseema.
Linguistic-Acoustic Similarity Based Accent Shift for Accent Recognition
- Computer ScienceINTERSPEECH
- 2022
LASAS is proposed, which concatenates the accent shift with a dimension-reduced text vector to obtain a linguistic-acoustic bimodal representation, which is richer and more clear by taking full advantage of both linguistic and acoustic information, which can effectively improve AR performance.
Minimum Word Error Training For Non-Autoregressive Transformer-Based Code-Switching ASR
- Computer ScienceICASSP
- 2022
This paper proposes various approaches to boosting the performance of a CTC-mask-based non-autoregressive Transformer under code-switching ASR scenario, and employs Minimum Word Error criterion to train the model.
Improving Vietnamese Accent Recognition Using ASR Transfer Learning
- Computer Science2022 25th Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)
- 2022
This paper proposes a transfer learning method using pretrained ASR models for Vietnamese accent recognition that helps the system utilize available speech recognition systems while capturing implicit linguistic and phonetic information learned in ASR to improve its performance.
Transducer-based language embedding for spoken language identification
- Computer Science, LinguisticsINTERSPEECH
- 2022
Experimental results showed the proposed method improves the performance on LID tasks with 12% to 59% and 16% to 24% relative improvement on in-domain and cross-domain datasets, respectively.
PM-MMUT: Boosted Phone-mask Data Augmentation using Multi-Modeling Unit Training for Phonetic-Reduction-Robust E2E Speech Recognition
- Computer ScienceINTERSPEECH
- 2022
Experimental results on Uyghur ASR show that the proposed MMUT approaches outperform obviously the pure PMT, and experiments on the 960-hour Librispeech benchmark using ESPnet1, which achieves about 10% relative WER reduction on all the test set without LM fusion comparing with the latest official ESP net1 pre-trained model.
References
SHOWING 1-10 OF 32 REFERENCES
Joint Phoneme-Grapheme Model for End-To-End Speech Recognition
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
A joint model is proposed based on "iterative refinement" where dependency modeling is achieved by a multi-pass strategy and performance of a conventional multi-task approach is contrasted with that of the joint model with iterative refinement.
A Multi-Accent Acoustic Model Using Mixture of Experts for Speech Recognition
- Computer ScienceINTERSPEECH
- 2019
This work proposes a novel acoustic model architecture based on Mixture of Experts (MoE) which works well on multiple accents without having the overhead of training separate models for separate accents.
Improved Accented Speech Recognition Using Accent Embeddings and Multi-task Learning
- Computer ScienceINTERSPEECH
- 2018
A multi-task architecture that jointly learns an accent classifier and a multi-accent acoustic model is proposed and augmenting the speech input with accent information in the form of embeddings extracted by a separate network is considered.
Multi-Accent Adaptation Based on Gate Mechanism
- Computer ScienceINTERSPEECH
- 2019
This work proposes using accent-specific top layer with gate mechanism (AST-G) to realize multi-accent adaptation, and applies using an accent classifier to predict the accent label to jointly train the acoustic model and the accent classifiers.
Large-Scale End-to-End Multilingual Speech Recognition and Language Identification with Multi-Task Learning
- Computer Science, LinguisticsINTERSPEECH
- 2020
A large-scale end-to-end languageindependent multilingual model for joint automatic speech recognition (ASR) and language identification (LID) and achieves word error rate (WER) of 52.8 and LID accuracy of 93.5 on 42 languages with around 5000 hours of training data is reported.
Cross-Language Transfer Learning, Continuous Learning, and Domain Adaptation for End-to-End Automatic Speech Recognition
- Computer Science
- 2020
This paper demonstrates the efficacy of transfer learning and continuous learning for various automatic speech recognition (ASR) tasks and shows that in all three cases, transfer learning from a good base model has higher accuracy than a model trained from scratch.
Joint Modeling of Accents and Acoustics for Multi-Accent Speech Recognition
- Computer Science2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2018
Experiments on the American English Wall Street Journal and British English Cambridge corpora demonstrate that the joint model outperforms the strong multi-task acoustic model baseline, and illustrates that jointly modeling with accent information improves acoustic model performance.
AISpeech-SJTU Accent Identification System for the Accented English Speech Recognition Challenge
- Computer ScienceICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2021
The AISpeech-SJTU system is ranked first in the challenge and outperforms all the other participants by a large margin, and the test time augmentation and embedding fusion schemes to further improve the system performance are proposed.
Language independent end-to-end architecture for joint language identification and speech recognition
- Computer Science2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
- 2017
This paper presents a model that can recognize speech in 10 different languages, by directly performing grapheme (character/chunked-character) based speech recognition, based on the hybrid attention/connectionist temporal classification (CTC) architecture.
Accent Identification by Combining Deep Neural Networks and Recurrent Neural Networks Trained on Long and Short Term Features
- Computer ScienceINTERSPEECH
- 2016
A combination of long-term and short-term training is proposed in this paper for automatic identification of foreign accents, and the performance greatly surpasses the provided baseline system.