Streaming End-to-End Speech Recognition with Jointly Trained Neural Feature Enhancement
@article{Kim2021StreamingES, title={Streaming End-to-End Speech Recognition with Jointly Trained Neural Feature Enhancement}, author={Chanwoo Kim and Abhinav Garg and Dhananjaya N. Gowda and Seongkyu Mun and Chang Woo Han}, journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, year={2021}, pages={6773-6777} }
In this paper, we present a streaming end-to-end speech recognition model based on Monotonic Chunkwise Attention (MoCha) jointly trained with enhancement layers. Even though the MoCha attention enables streaming speech recognition with recognition accuracy comparable to a full attention-based approach, training this model is sensitive to various factors such as the difficulty of training examples, hyper-parameters, and so on. Because of these issues, speech recognition accuracy of a MoCha-based…
3 Citations
Multitask-based joint learning approach to robust ASR for radio communication speech
- Computer Science2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
- 2021
This paper proposes a multitask-based method to jointly train a Speech Enhancement module as the front-end and an E2E ASR model as the back-end, and proposes a dual-channel data augmentation training method to obtain further improve-ment.
Macro-Block Dropout for Improved Regularization in Training End-to-End Speech Recognition Models
- Computer Science2022 IEEE Spoken Language Technology Workshop (SLT)
- 2023
This work defines a macro-block that contains a large number of units from the input to a Recurrent Neural Network (RNN) and applies random dropout to each macro-blocks, which has the effect of applying different drop out rates for each layer even if the authors keep a constant average dropout rate.
Efficient Parallel Computing for Machine Learning at Scale
- Computer Science
- 2020
Efficient Parallel Computing for Machine Learning at Scale is presented, a state-of-the-art approach to parallel computing that automates the very labor-intensive and therefore time-heavy and expensive process of learning and reinforcement learning.
References
SHOWING 1-10 OF 21 REFERENCES
Attention Based On-Device Streaming Speech Recognition with Large Speech Corpus
- Computer Science2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
- 2019
In this paper, we present a new on-device automatic speech recognition (ASR) system based on monotonic chunk-wise attention (MoChA) models trained with large (> 10K hours) corpus. We attained around…
Joint CTC-attention based end-to-end speech recognition using multi-task learning
- Computer Science2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2017
A novel method for end-to-end speech recognition to improve robustness and achieve fast convergence by using a joint CTC-attention model within the multi-task learning framework, thereby mitigating the alignment issue.
State-of-the-Art Speech Recognition with Sequence-to-Sequence Models
- Computer Science2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2018
A variety of structural and optimization improvements to the Listen, Attend, and Spell model are explored, which significantly improve performance and a multi-head attention architecture is introduced, which offers improvements over the commonly-used single- head attention.
Attention-Based Models for Speech Recognition
- Computer ScienceNIPS
- 2015
The attention-mechanism is extended with features needed for speech recognition and a novel and generic method of adding location-awareness to the attention mechanism is proposed to alleviate the issue of high phoneme error rate.
End-to-End Training of a Large Vocabulary End-to-End Speech Recognition System
- Computer Science2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
- 2019
The authors' end-to-end speech recognition system built using this training infrastructure showed a 2.44 % WER on test-clean of the LibriSpeech test set after applying shallow fusion with a Transformer language model (LM).
Online Hybrid CTC/Attention Architecture for End-to-End Speech Recognition
- Computer ScienceINTERSPEECH
- 2019
A stable monotonic chunkwise attention (sMoChA) to stream its attention branch and a truncated CTC prefix probability (T-CTC) to streamed its CTC branch to stream the hybrid CTC/attention ASR system without much word error rate degradation.
Monotonic Chunkwise Attention
- Computer ScienceICLR
- 2018
Monotonic Chunkwise Attention (MoChA), which adaptively splits the input sequence into small chunks over which soft attention is computed, is proposed and shown that models utilizing MoChA can be trained efficiently with standard backpropagation while allowing online and linear-time decoding at test time.
Improved Vocal Tract Length Perturbation for a State-of-the-Art End-to-End Speech Recognition System
- Computer ScienceINTERSPEECH
- 2019
An improved vocal tract length perturbation (VTLP) algorithm as a data augmentation technique using the shallow-fusion technique with a Transformer LM and an attentionbased end-to-end speech recognition system without using any Language Models (LMs).
Generation of Large-Scale Simulated Utterances in Virtual Rooms to Train Deep-Neural Networks for Far-Field Speech Recognition in Google Home
- PhysicsINTERSPEECH
- 2017
The structure and application of an acoustic room simulator to generate large-scale simulated data for training deep neural networks for far-field speech recognition and performance is evaluated using a factored complex Fast Fourier Transform (CFFT) acoustic model introduced in earlier work.
Efficient Implementation of the Room Simulator for Training Deep Neural Network Acoustic Models
- Computer ScienceINTERSPEECH
- 2018
This work implements an efficient OverLap Addition (OLA) based filtering using the open-source FFTW3 library and investigates the effects of the Room Impulse Response (RIR) lengths.