• Corpus ID: 3264579

Multi-Channel Speech Recognition : LSTMs All the Way Through

  title={Multi-Channel Speech Recognition : LSTMs All the Way Through},
  author={Hakan Erdogan and Tomoki Hayashi and John R. Hershey and T. Hori and Chiori Hori and Wei-Ning Hsu and Suyoun Kim and Jonathan Le Roux and Zhong Meng and Shinji Watanabe},
Long Short-Term Memory recurrent neural networks (LSTMs) have demonstrable advantages on a variety of sequential learning tasks. In this paper we demonstrate an LSTM “triple threat” system for speech recognition, where LSTMs drive the three main subsystems: microphone array processing, acoustic modeling, and language modeling. This LSTM trifecta is applied to the CHiME-4 distant recognition challenge. Our previous state-of-the-art ASR systems for the previous CHiME challenge employed LSTM mask… 

Figures and Tables from this paper

Building state-of-the-art distant speech recognition using the CHiME-4 challenge with a setup of speech enhancement baseline

This paper describes a new baseline system for automatic speech recognition (ASR) in the CHiME-4 challenge to promote the development of noisy ASR in speech processing communities by providing 1)

Speaker Adaptation for Multichannel End-to-End Speech Recognition

Experimental results using CHiME-4 show that the proposed multi-path adaptation scheme improves ASR performance and adapting the encoder network is more effective than adapting the neural beamformer, attention mechanism, or decoder network.

Densenet Blstm for Acoustic Modeling in Robust ASR

The DenseNet topology is modified to become a kind of feature extractor for the subsequent BLSTM network operating on whole speech utterances and is able to consistently outperform a top-performing baseline based on wide residual networks and BLSTMs providing a 2.4% relative WER reduction on the real test set.

Modular Hybrid Autoregressive Transducer

This work proposes a modular hybrid autoregressive transducer (MHAT) that has structurally separated label and blank decoders to predict label andblank distributions, respectively, along with a shared acoustic encoder.

Internal Language Model Estimation for Domain-Adaptive End-to-End Speech Recognition

An internal LM estimation (ILME) method to facilitate a more effective integration of the external LM with all pre-existing E2E models with no additional model training, including the most popular recurrent neural network transducer (RNN-T) and attention-based encoder-decoder (AED) models.

Joint Training of Complex Ratio Mask Based Beamformer and Acoustic Model for Noise Robust Asr

  • Yong XuChao Weng Dong Yu
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
The complex ratio mask (CRM) is proposed to estimate the covariance matrix for the beamformer and a long short-term memory (LSTM) based language model is utilized to re-score hypotheses which further improves the overall performance.

L2RS: A Learning-to-Rescore Mechanism for Automatic Speech Recognition

A novel Learning-to-Rescore (L2RS) mechanism is proposed, which is specialized for utilizing a wide range of textual information from the state-of-the-art NLP models and automatically deciding their weights to rescore the N-best lists for ASR systems.

Character-Aware Attention-Based End-to-End Speech Recognition

A novel character-aware (CA) AED model in which each WSU embedding is computed by summarizing the embeddings of its constituent characters using a CA-RNN, which significantly reduces the model parameters in a traditional AED.

Non-Uniform MCE Training of Deep Long Short-Term Memory Recurrent Neural Networks for Keyword Spotting

A deep bidirectional long short-term memory (BLSTM) hidden Markov model (HMM) based acoustic model with non-uniform boosted minimum classification error (BMCE) criterion which imposes more significant error cost on the keywords than those on the non-keywords is trained.

Unified Architecture for Multichannel End-to-End Speech Recognition With Neural Beamforming

This paper proposes a unified architecture for end-to-end automatic speech recognition (ASR) to encompass microphone-array signal processing such as a state-of-the-art neural beamformer within the end- to-end framework and elaborate the effectiveness of this proposed method on the multichannel ASR benchmarks in noisy environments.



Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks

Several integration architectures are proposed and tested, including a pipeline architecture of L STM-based SE and ASR with sequence training, an alternating estimation architecture, and a multi-task hybrid LSTM network architecture.

The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices

NTT's CHiME-3 system is described, which integrates advanced speech enhancement and recognition techniques, which achieves a 3.45% development error rate and a 5.83% evaluation error rate.

Recurrent deep neural networks for robust speech recognition

Full recurrent connections are added to certain hidden layer of a conventional feedforward DNN and allow the model to capture the temporal dependency in deep representations to achieve state-of-the-art performance without front-end preprocessing, speaker adaptive training or multiple decoding passes.

Neural Network Adaptive Beamforming for Robust Multichannel Speech Recognition

A neural network adaptive beamforming (NAB) technique that uses LSTM layers to predict time domain beamforming filter coefficients at each input frame and achieves a 12.7% relative improvement in WER over a single channel model.

End-to-end attention-based large vocabulary speech recognition

This work investigates an alternative method for sequence modelling based on an attention mechanism that allows a Recurrent Neural Network (RNN) to learn alignments between sequences of input frames and output labels.

KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition

Experiments demonstrate that the proposed adaptation technique can provide 2%-30% relative error reduction against the already very strong speaker independent CD-DNN-HMM systems using different adaptation sets under both supervised and unsupervised adaptation setups.

Joint CTC-attention based end-to-end speech recognition using multi-task learning

A novel method for end-to-end speech recognition to improve robustness and achieve fast convergence by using a joint CTC-attention model within the multi-task learning framework, thereby mitigating the alignment issue.

Towards End-To-End Speech Recognition with Recurrent Neural Networks

This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of the

Sequence-discriminative training of deep neural networks

Different sequence-discriminative criteria are shown to lower word error rates by 7-9% relative, on a standard 300 hour American conversational telephone speech task.

Neural network based spectral mask estimation for acoustic beamforming

A neural network based approach to acoustic beamforming is presented, used to estimate spectral masks from which the Cross-Power Spectral Density matrices of speech and noise are estimated, which are used to compute the beamformer coefficients.