Attention based end to end Speech Recognition for Voice Search in Hindi and English

@article{Joshi2021AttentionBE,
  title={Attention based end to end Speech Recognition for Voice Search in Hindi and English},
  author={Raviraj Joshi and Venkateshan Kannan},
  journal={Forum for Information Retrieval Evaluation},
  year={2021}
}
We describe here our work with automatic speech recognition (ASR) in the context of voice search functionality on the Flipkart e-Commerce platform. Starting with the deep learning architecture of Listen-Attend-Spell (LAS), we build upon and expand the model design and attention mechanisms to incorporate innovative approaches including multi-objective training, multi-pass training, and external rescoring using language models and phoneme based losses. We report a relative WER improvement of 15.7… 

Figures from this paper

On Comparison of Encoders for Attention based End to End Speech Recognition in Standalone and Rescoring Mode
TLDR
This work evaluates the non-streaming attention-based end-to-end ASR models on the Flipkart voice search task in both standalone and re-scoring modes and shows that the Transformer model offers acceptable WER with the lowest latency requirements.
A Simple Baseline for Domain Adaptation in End to End ASR Systems Using Synthetic Data
TLDR
This work proposes a simple baseline technique for domain adaptation in end-to-end speech recognition models, and shows that single speaker synthetic TTS data coupled with final dense layer only fine-tuning provides reasonable improvements in word error rates.

References

SHOWING 1-10 OF 43 REFERENCES
State-of-the-Art Speech Recognition with Sequence-to-Sequence Models
  • C. Chiu, T. Sainath, M. Bacchiani
  • Computer Science
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
TLDR
A variety of structural and optimization improvements to the Listen, Attend, and Spell model are explored, which significantly improve performance and a multi-head attention architecture is introduced, which offers improvements over the commonly-used single- head attention.
ISI ASR System for the Low Resource Speech Recognition Challenge for Indian Languages
TLDR
The ISI ASR system used to generate ISI’s submissions across Gujarati, Tamil and Telugu speech recognition tasks as part of the Low Resource Speech Recognition Challenge for Indian Languages demonstrates, to the best of the knowledge, one of the first times such systems have been applied to low resource languages with performance comparable and some cases better than hybrid DNN systems.
Attention-Based Models for Speech Recognition
TLDR
The attention-mechanism is extended with features needed for speech recognition and a novel and generic method of adding location-awareness to the attention mechanism is proposed to alleviate the issue of high phoneme error rate.
Listen, Attend and Spell
TLDR
A neural network that learns to transcribe speech utterances to characters without making any independence assumptions between the characters, which is the key improvement of LAS over previous end-to-end CTC models.
Joint CTC-attention based end-to-end speech recognition using multi-task learning
TLDR
A novel method for end-to-end speech recognition to improve robustness and achieve fast convergence by using a joint CTC-attention model within the multi-task learning framework, thereby mitigating the alignment issue.
A Comparison of Sequence-to-Sequence Models for Speech Recognition
TLDR
It is found that the sequence-to-sequence models are competitive with traditional state-of-the-art approaches on dictation test sets, although the baseline, which uses a separate pronunciation and language model, outperforms these models on voice-search test sets.
FastSpeech: Fast, Robust and Controllable Text to Speech
TLDR
A novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS is proposed, which speeds up mel-Spectrogram generation by 270x and the end-to-end speech synthesis by 38x and is called FastSpeech.
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
TLDR
This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.
Interspeech 2018 Low Resource Automatic Speech Recognition Challenge for Indian Languages
TLDR
A low-resource Automatic Speech Recognition challenge for Indian languages as part of Interspeech 2018, which received 109 submissions from 18 research groups and evaluated the systems in terms of Word Error Rate on a blind test set.
A Comparative Study on Transformer vs RNN in Speech Applications
TLDR
An emergent sequence-to-sequence model called Transformer achieves state-of-the-art performance in neural machine translation and other natural language processing applications, including the surprising superiority of Transformer in 13/15 ASR benchmarks in comparison with RNN.
...
...