Improved Accented Speech Recognition Using Accent Embeddings and Multi-task Learning

  title={Improved Accented Speech Recognition Using Accent Embeddings and Multi-task Learning},
  author={Abhinav Jain and Minali Upreti and Preethi Jyothi},
One of the major remaining challenges in modern automatic speech recognition (ASR) systems for English is to be able to handle speech from users with a diverse set of accents. ASR systems that are trained on speech from multiple English accents still underperform when confronted with a new speech accent. In this work, we explore how to use accent embeddings and multi-task learning to improve speech recognition for accented speech. We propose a multi-task architecture that jointly learns an… 

Figures and Tables from this paper

Improved BLSTM RNN Based Accent Speech Recognition Using Multi-task Learning and Accent Embeddings

This paper considers augmenting the speech input with accent information in the form of embeddings extracted by a standalone network and proposes multi-task learning architecture that jointly learn an accent classifier and a multi-accent acoustic model.

Achieving Multi-Accent ASR via Unsupervised Acoustic Model Adaptation

This work proposes to compute x-vector-like accent embeddings and use them as auxiliary inputs to an acoustic model trained on native data only in order to improve the recognition of multi-accent data comprising native, non-native, and accented speech.

Coupled Training of Sequence-to-Sequence Models for Accented Speech Recognition

  • Vinit UnniNitish JoshiP. Jyothi
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
This work proposes coupled training for encoder-decoder ASR models that acts on pairs of utterances corresponding to the same text spoken by speakers with different accents, thus acting as a regularizer and encouraging representations from the encoder to be more accent-invariant.

A Multi-Accent Acoustic Model Using Mixture of Experts for Speech Recognition

This work proposes a novel acoustic model architecture based on Mixture of Experts (MoE) which works well on multiple accents without having the overhead of training separate models for separate accents.

Learning Fast Adaptation on Cross-Accented Speech Recognition

This paper introduces a cross-accented English speech recognition task as a benchmark for measuring the ability of the model to adapt to unseen accents using the existing CommonVoice corpus and proposes an accent-agnostic approach that extends the model-agnostics meta-learning (MAML) algorithm for fast adaptation toseen accents.

Accent-Robust Automatic Speech Recognition Using Supervised and Unsupervised Wav2vec Embeddings

This study performs systematic comparisons of DAT and MTL approaches using a large volume of English accent corpus and finds that the DAT modeltrained with supervised embeddings achieves the best performance overall and consistently provides benefits for all testing datasets, and the MTL model trained with wav2vec embedDings are helpful learning accentinvariant features and improving novel/unseen accents.

Layer-Wise Fast Adaptation for End-to-End Multi-Accent Speech Recognition

This work aims to improve multi-accent speech recognition in the end-to-end (E2E) framework with a novel layer-wise adaptation architecture with the accurate accent embedding outperforms the other traditional methods, and obtains consistent $\sim$15% relative word error rate (WER) reduction on all kinds of testing scenarios.

Accented Speech Recognition Inspired by Human Perception

Methods based on human perception are promising in reducing WER and understanding how accented speech is modeled in neural networks for novel accents.

End-to-End Accented Speech Recognition

This work explores the use of multi-task training and accent embedding in the context of end-to-end ASR trained with the connectionist temporal classification loss and shows relative improvement in word error rate.

Improving Accent Identification and Accented Speech Recognition Under a Framework of Self-supervised Learning

An accent-dependent ASR system, which can utilize additional accent input features, and a frame-level accent feature, which is extracted based on the proposed accent identification model and can be dynamically adjusted are proposed.



Joint Modeling of Accents and Acoustics for Multi-Accent Speech Recognition

Experiments on the American English Wall Street Journal and British English Cambridge corpora demonstrate that the joint model outperforms the strong multi-task acoustic model baseline, and illustrates that jointly modeling with accent information improves acoustic model performance.

Automatic speech recognition of multiple accented English data

There is significant performance degradation of a baseline system trained on only US data when confronted with shows from other regions, but results improve significantly when data from all the regions are included for accent-independent acoustic model training.

Multi-accent Chinese speech recognition

A method to handle multiple accents as well as standard speech in a speaker-independent system by merging auxiliary accent decision trees with standard trees and reconstruct the acoustic model is proposed.

Multi-Accent Speech Recognition of Afrikaans, Black and White Varieties of South African English

Investigation of speech recognition performance of systems employing several accent-specific recognisers in parallel for the simultaneous recognition of multiple accents finds that parallel systems outperform oracle systems for the AE+EE accent pair while the opposite is observed for BE+EE.

Accent detection and speech recognition for Shanghai-accented Mandarin

A new approach that combines accent detection, accent discriminative acoustic features, acoustic adaptation and model selection for accented Chinese speech recognition is proposed and experimental results show that this approach can improve the recognition of accented speech.

Improving deep neural networks based multi-accent Mandarin speech recognition using i-vectors and accent-specific top layer

A method that use i-vectors and model adaptation techniques to improve the performance of deep neural networks based multi-accent Mandarin speech recognition by using an accent-specific top layer and shared hidden layers is proposed.

Multi-accent deep neural network acoustic model with accent-specific top layer using the KLD-regularized model adaptation

A multi-accent deep neural network acoustic model with an accent-specific top layer and shared bottom hidden layers used to model the distinct accent specific patterns and smaller yet significant WER reduction on a baseline model trained using the MMI sequence-level criterion.

Speech recognition of multiple accented English data using acoustic model interpolation

This work uses model interpolation as an unsupervised adaptation framework, where the interpolation coefficients are estimated on-the-fly for each test segment, and a theoretically motivated EM-like mixture reduction algorithm is proposed.

Multi-accent speech recognition with hierarchical grapheme based models

  • Kanishka RaoH. Sak
  • Computer Science
    2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2017
This work trains grapheme-based acoustic models for speech recognition using a hierarchical recurrent neural network architecture with connectionist temporal classification (CTC) loss and observes large recognition accuracy improvements for Indian-accented utterances in Google VoiceSearch US traffic with a 40% relative WER reduction.

Stacked Long-Term TDNN for Spoken Language Recognition

A stacked architecture that uses a time delay neural network (TDNN) to model long-term patterns for spoken language identification and provides complementary information to fuse with the new generation of bottleneckbased i-vector systems that model short-term dependencies.