Multi-Channel Speech Recognition : LSTMs All the Way Through
@inproceedings{Erdogan2016MultiChannelSR, title={Multi-Channel Speech Recognition : LSTMs All the Way Through}, author={Hakan Erdogan and Tomoki Hayashi and John R. Hershey and T. Hori and Chiori Hori and Wei-Ning Hsu and Suyoun Kim and Jonathan Le Roux and Zhong Meng and Shinji Watanabe}, year={2016} }
Long Short-Term Memory recurrent neural networks (LSTMs) have demonstrable advantages on a variety of sequential learning tasks. In this paper we demonstrate an LSTM “triple threat” system for speech recognition, where LSTMs drive the three main subsystems: microphone array processing, acoustic modeling, and language modeling. This LSTM trifecta is applied to the CHiME-4 distant recognition challenge. Our previous state-of-the-art ASR systems for the previous CHiME challenge employed LSTM mask…
66 Citations
Building state-of-the-art distant speech recognition using the CHiME-4 challenge with a setup of speech enhancement baseline
- Computer ScienceINTERSPEECH
- 2018
This paper describes a new baseline system for automatic speech recognition (ASR) in the CHiME-4 challenge to promote the development of noisy ASR in speech processing communities by providing 1)…
Speaker Adaptation for Multichannel End-to-End Speech Recognition
- Computer Science2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2018
Experimental results using CHiME-4 show that the proposed multi-path adaptation scheme improves ASR performance and adapting the encoder network is more effective than adapting the neural beamformer, attention mechanism, or decoder network.
Densenet Blstm for Acoustic Modeling in Robust ASR
- Computer Science2018 IEEE Spoken Language Technology Workshop (SLT)
- 2018
The DenseNet topology is modified to become a kind of feature extractor for the subsequent BLSTM network operating on whole speech utterances and is able to consistently outperform a top-performing baseline based on wide residual networks and BLSTMs providing a 2.4% relative WER reduction on the real test set.
Modular Hybrid Autoregressive Transducer
- Computer ScienceArXiv
- 2022
This work proposes a modular hybrid autoregressive transducer (MHAT) that has structurally separated label and blank decoders to predict label andblank distributions, respectively, along with a shared acoustic encoder.
Internal Language Model Estimation for Domain-Adaptive End-to-End Speech Recognition
- Computer Science2021 IEEE Spoken Language Technology Workshop (SLT)
- 2021
An internal LM estimation (ILME) method to facilitate a more effective integration of the external LM with all pre-existing E2E models with no additional model training, including the most popular recurrent neural network transducer (RNN-T) and attention-based encoder-decoder (AED) models.
Joint Training of Complex Ratio Mask Based Beamformer and Acoustic Model for Noise Robust Asr
- Computer ScienceICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
The complex ratio mask (CRM) is proposed to estimate the covariance matrix for the beamformer and a long short-term memory (LSTM) based language model is utilized to re-score hypotheses which further improves the overall performance.
L2RS: A Learning-to-Rescore Mechanism for Automatic Speech Recognition
- Computer ScienceArXiv
- 2019
A novel Learning-to-Rescore (L2RS) mechanism is proposed, which is specialized for utilizing a wide range of textual information from the state-of-the-art NLP models and automatically deciding their weights to rescore the N-best lists for ASR systems.
Character-Aware Attention-Based End-to-End Speech Recognition
- Computer Science2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
- 2019
A novel character-aware (CA) AED model in which each WSU embedding is computed by summarizing the embeddings of its constituent characters using a CA-RNN, which significantly reduces the model parameters in a traditional AED.
Non-Uniform MCE Training of Deep Long Short-Term Memory Recurrent Neural Networks for Keyword Spotting
- Computer ScienceINTERSPEECH
- 2017
A deep bidirectional long short-term memory (BLSTM) hidden Markov model (HMM) based acoustic model with non-uniform boosted minimum classification error (BMCE) criterion which imposes more significant error cost on the keywords than those on the non-keywords is trained.
Unified Architecture for Multichannel End-to-End Speech Recognition With Neural Beamforming
- Computer ScienceIEEE Journal of Selected Topics in Signal Processing
- 2017
This paper proposes a unified architecture for end-to-end automatic speech recognition (ASR) to encompass microphone-array signal processing such as a state-of-the-art neural beamformer within the end- to-end framework and elaborate the effectiveness of this proposed method on the multichannel ASR benchmarks in noisy environments.
References
SHOWING 1-10 OF 26 REFERENCES
Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks
- Computer ScienceINTERSPEECH
- 2015
Several integration architectures are proposed and tested, including a pipeline architecture of L STM-based SE and ASR with sequence training, an alternating estimation architecture, and a multi-task hybrid LSTM network architecture.
The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices
- Computer Science2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)
- 2015
NTT's CHiME-3 system is described, which integrates advanced speech enhancement and recognition techniques, which achieves a 3.45% development error rate and a 5.83% evaluation error rate.
Recurrent deep neural networks for robust speech recognition
- Computer Science2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2014
Full recurrent connections are added to certain hidden layer of a conventional feedforward DNN and allow the model to capture the temporal dependency in deep representations to achieve state-of-the-art performance without front-end preprocessing, speaker adaptive training or multiple decoding passes.
Neural Network Adaptive Beamforming for Robust Multichannel Speech Recognition
- Computer ScienceINTERSPEECH
- 2016
A neural network adaptive beamforming (NAB) technique that uses LSTM layers to predict time domain beamforming filter coefficients at each input frame and achieves a 12.7% relative improvement in WER over a single channel model.
End-to-end attention-based large vocabulary speech recognition
- Computer Science2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2016
This work investigates an alternative method for sequence modelling based on an attention mechanism that allows a Recurrent Neural Network (RNN) to learn alignments between sequences of input frames and output labels.
KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition
- Computer Science2013 IEEE International Conference on Acoustics, Speech and Signal Processing
- 2013
Experiments demonstrate that the proposed adaptation technique can provide 2%-30% relative error reduction against the already very strong speaker independent CD-DNN-HMM systems using different adaptation sets under both supervised and unsupervised adaptation setups.
Joint CTC-attention based end-to-end speech recognition using multi-task learning
- Computer Science2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2017
A novel method for end-to-end speech recognition to improve robustness and achieve fast convergence by using a joint CTC-attention model within the multi-task learning framework, thereby mitigating the alignment issue.
Towards End-To-End Speech Recognition with Recurrent Neural Networks
- Computer ScienceICML
- 2014
This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of the…
Sequence-discriminative training of deep neural networks
- Computer ScienceINTERSPEECH
- 2013
Different sequence-discriminative criteria are shown to lower word error rates by 7-9% relative, on a standard 300 hour American conversational telephone speech task.
Neural network based spectral mask estimation for acoustic beamforming
- Computer Science2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2016
A neural network based approach to acoustic beamforming is presented, used to estimate spectral masks from which the Cross-Power Spectral Density matrices of speech and noise are estimated, which are used to compute the beamformer coefficients.