Streaming End-to-End Speech Recognition with Jointly Trained Neural Feature Enhancement

  title={Streaming End-to-End Speech Recognition with Jointly Trained Neural Feature Enhancement},
  author={Chanwoo Kim and Abhinav Garg and Dhananjaya N. Gowda and Seongkyu Mun and Chang Woo Han},
  journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  • Chanwoo KimAbhinav Garg C. Han
  • Published 4 May 2021
  • Computer Science
  • ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
In this paper, we present a streaming end-to-end speech recognition model based on Monotonic Chunkwise Attention (MoCha) jointly trained with enhancement layers. Even though the MoCha attention enables streaming speech recognition with recognition accuracy comparable to a full attention-based approach, training this model is sensitive to various factors such as the difficulty of training examples, hyper-parameters, and so on. Because of these issues, speech recognition accuracy of a MoCha-based… 

Figures and Tables from this paper

Multitask-based joint learning approach to robust ASR for radio communication speech

This paper proposes a multitask-based method to jointly train a Speech Enhancement module as the front-end and an E2E ASR model as the back-end, and proposes a dual-channel data augmentation training method to obtain further improve-ment.

Macro-Block Dropout for Improved Regularization in Training End-to-End Speech Recognition Models

This work defines a macro-block that contains a large number of units from the input to a Recurrent Neural Network (RNN) and applies random dropout to each macro-blocks, which has the effect of applying different drop out rates for each layer even if the authors keep a constant average dropout rate.

Efficient Parallel Computing for Machine Learning at Scale

Efficient Parallel Computing for Machine Learning at Scale is presented, a state-of-the-art approach to parallel computing that automates the very labor-intensive and therefore time-heavy and expensive process of learning and reinforcement learning.



Attention Based On-Device Streaming Speech Recognition with Large Speech Corpus

In this paper, we present a new on-device automatic speech recognition (ASR) system based on monotonic chunk-wise attention (MoChA) models trained with large (> 10K hours) corpus. We attained around

Joint CTC-attention based end-to-end speech recognition using multi-task learning

A novel method for end-to-end speech recognition to improve robustness and achieve fast convergence by using a joint CTC-attention model within the multi-task learning framework, thereby mitigating the alignment issue.

State-of-the-Art Speech Recognition with Sequence-to-Sequence Models

A variety of structural and optimization improvements to the Listen, Attend, and Spell model are explored, which significantly improve performance and a multi-head attention architecture is introduced, which offers improvements over the commonly-used single- head attention.

Attention-Based Models for Speech Recognition

The attention-mechanism is extended with features needed for speech recognition and a novel and generic method of adding location-awareness to the attention mechanism is proposed to alleviate the issue of high phoneme error rate.

End-to-End Training of a Large Vocabulary End-to-End Speech Recognition System

The authors' end-to-end speech recognition system built using this training infrastructure showed a 2.44 % WER on test-clean of the LibriSpeech test set after applying shallow fusion with a Transformer language model (LM).

Online Hybrid CTC/Attention Architecture for End-to-End Speech Recognition

A stable monotonic chunkwise attention (sMoChA) to stream its attention branch and a truncated CTC prefix probability (T-CTC) to streamed its CTC branch to stream the hybrid CTC/attention ASR system without much word error rate degradation.

Monotonic Chunkwise Attention

Monotonic Chunkwise Attention (MoChA), which adaptively splits the input sequence into small chunks over which soft attention is computed, is proposed and shown that models utilizing MoChA can be trained efficiently with standard backpropagation while allowing online and linear-time decoding at test time.

Improved Vocal Tract Length Perturbation for a State-of-the-Art End-to-End Speech Recognition System

An improved vocal tract length perturbation (VTLP) algorithm as a data augmentation technique using the shallow-fusion technique with a Transformer LM and an attentionbased end-to-end speech recognition system without using any Language Models (LMs).

Generation of Large-Scale Simulated Utterances in Virtual Rooms to Train Deep-Neural Networks for Far-Field Speech Recognition in Google Home

The structure and application of an acoustic room simulator to generate large-scale simulated data for training deep neural networks for far-field speech recognition and performance is evaluated using a factored complex Fast Fourier Transform (CFFT) acoustic model introduced in earlier work.

Efficient Implementation of the Room Simulator for Training Deep Neural Network Acoustic Models

This work implements an efficient OverLap Addition (OLA) based filtering using the open-source FFTW3 library and investigates the effects of the Room Impulse Response (RIR) lengths.