Attention based end to end Speech Recognition for Voice Search in Hindi and English

  title={Attention based end to end Speech Recognition for Voice Search in Hindi and English},
  author={Raviraj Joshi and Venkateshan Kannan},
  journal={Forum for Information Retrieval Evaluation},
We describe here our work with automatic speech recognition (ASR) in the context of voice search functionality on the Flipkart e-Commerce platform. Starting with the deep learning architecture of Listen-Attend-Spell (LAS), we build upon and expand the model design and attention mechanisms to incorporate innovative approaches including multi-objective training, multi-pass training, and external rescoring using language models and phoneme based losses. We report a relative WER improvement of 15.7… 

Figures from this paper

A Simple Baseline for Domain Adaptation in End to End ASR Systems Using Synthetic Data
Automatic Speech Recognition(ASR) has been dominated by deep learning-based end-to-end speech recognition models. These approaches require large amounts of labeled data in the form of audio-text


State-of-the-Art Speech Recognition with Sequence-to-Sequence Models
  • C. Chiu, T. Sainath, M. Bacchiani
  • Computer Science
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
A variety of structural and optimization improvements to the Listen, Attend, and Spell model are explored, which significantly improve performance and a multi-head attention architecture is introduced, which offers improvements over the commonly-used single- head attention.
ISI ASR System for the Low Resource Speech Recognition Challenge for Indian Languages
  • J. Billa
  • Computer Science
  • 2018
The ISI ASR system used to generate ISI’s submissions across Gujarati, Tamil and Telugu speech recognition tasks as part of the Low Resource Speech Recognition Challenge for Indian Languages demonstrates, to the best of the knowledge, one of the first times such systems have been applied to low resource languages with performance comparable and some cases better than hybrid DNN systems.
Attention-Based Models for Speech Recognition
The attention-mechanism is extended with features needed for speech recognition and a novel and generic method of adding location-awareness to the attention mechanism is proposed to alleviate the issue of high phoneme error rate.
Listen, Attend and Spell
A neural network that learns to transcribe speech utterances to characters without making any independence assumptions between the characters, which is the key improvement of LAS over previous end-to-end CTC models.
Joint CTC-attention based end-to-end speech recognition using multi-task learning
A novel method for end-to-end speech recognition to improve robustness and achieve fast convergence by using a joint CTC-attention model within the multi-task learning framework, thereby mitigating the alignment issue.
A Comparison of Sequence-to-Sequence Models for Speech Recognition
It is found that the sequence-to-sequence models are competitive with traditional state-of-the-art approaches on dictation test sets, although the baseline, which uses a separate pronunciation and language model, outperforms these models on voice-search test sets.
FastSpeech: Fast, Robust and Controllable Text to Speech
A novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS is proposed, which speeds up mel-Spectrogram generation by 270x and the end-to-end speech synthesis by 38x and is called FastSpeech.
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.
Interspeech 2018 Low Resource Automatic Speech Recognition Challenge for Indian Languages
A low-resource Automatic Speech Recognition challenge for Indian languages as part of Interspeech 2018, which received 109 submissions from 18 research groups and evaluated the systems in terms of Word Error Rate on a blind test set.
TDNN-based Multilingual Speech Recognition System for Low Resource Indian Languages
A multilingual Time Delay Neural Network (TDNN) system that uses combined acoustic modeling and languagespecific information to decode the input test sequences and obtains a Word Error Rate of 16.07%, 17.14, 17.69%, respectively, which was the second best system at the challenge.