Light Gated Recurrent Units for Speech Recognition

  title={Light Gated Recurrent Units for Speech Recognition},
  author={Mirco Ravanelli and Philemon Brakel and Maurizio Omologo and Yoshua Bengio},
  journal={IEEE Transactions on Emerging Topics in Computational Intelligence},
A field that has directly benefited from the recent advances in deep learning is automatic speech recognition (ASR). Despite the great achievements of the past decades, however, a natural and robust human–machine speech interaction still appears to be out of reach, especially in challenging environments characterized by significant noise and reverberation. To improve robustness, modern speech recognizers often employ acoustic models based on recurrent neural networks (RNNs) that are naturally… 

Deep Learning for Distant Speech Recognition

Inspired by the idea that cooperation across different DNNs could be the key for counteracting the harmful effects of noise and reverberation, a novel deep learning paradigm called “network of deep neural networks” is proposed.

Twin Regularization for online speech recognition

This paper adds a regularization term that forces forward hidden states of a unidirectional recurrent network to be as close as possible to cotemporal backward ones, computed by a "twin" neural network running backwards in time.

Attention Is All You Need In Speech Separation

The SepFormer is proposed, a novel RNN-free Transformer-based neural network for speech separation that inherits the parallelization advantages of Transformers and achieves a competitive performance even when downsampling the encoded representation by a factor of 8.

A Deep 2D Convolutional Network for Waveform-Based Speech Recognition

This study provides empirical evidence that learning directly from the waveform domain could be more effective than learning using hand-crafted features, and proposes a deep 2 D convolutional network in the wave form domain.

Projected Minimal Gated Recurrent Unit for Speech Recognition

The paper proposes to insert a smaller output projection layer after the mGRUIP-Ctx cell’s output to form the PmGRU, which is inspired by the idea of low-rank decomposition of matrix, and adjusts the ratio of context information of the previous layer to the current layer by moving the position of batch normalization layer.

Edinburgh Research Explorer A Deep 2D Convolutional Network for Waveform-Based Speech Recognition

This study provides empirical evidence that learning directly from the waveform domain could be more effective than learning using hand-crafted features, and proposes a deep 2 D convolutional network in the wave form domain.

Deep Neural Network Based Speech Recognition Systems Under Noise Perturbations

This work investigates the capability of noise immunity in various neural network models through the speech recognition task and demonstrates that the phoneme error rate (PER) degrades as the signal-to-noise ratio (SNR) reduces across all evaluated neural network Models.

Advanced Convolutional Neural Network-Based Hybrid Acoustic Models for Low-Resource Speech Recognition

The results of contributions to combine CNN and conventional RNN with gate, highway, and residual networks to reduce the above problems are presented and the optimal neural network structures and training strategies for the proposed neural network models are explored.

Output-Gate Projected Gated Recurrent Unit for Speech Recognition

An architecture which is proposed which is called Projected Gated Recurrent Unit (PGRU) for automatic speech recognition (ASR) tasks, and it is shown that the PGRU could outperform the standard GRU consistently.

Simplified LSTMS for Speech Recognition

New variants of Long Short-Term Memory networks for sequential modeling of acoustic features are explored, showing that removing the output gate, replacing the hyperbolic tangent nonlinearity at the cell output with hard tanh, and collapsing the cell and hidden state vectors leads to a model that is conceptually simpler than and comparable in effectiveness to a regular LSTM for speech recognition.



Improving Speech Recognition by Revising Gated Recurrent Units

This work proposes to remove the reset gate in the GRU design, resulting in a more efficient single-gate architecture and replaces tanh with ReLU activations in the state update equations, and shows that the revised architecture consistently improves recognition performance across different tasks, input features, and noisy conditions when compared to a standard GRU.

Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks

A phase-sensitive objective function based on the signal-to-noise ratio (SNR) of the reconstructed signal is developed, and it is shown that in experiments it yields uniformly better results in terms of signal- to-distortion ratio (SDR).

Contaminated speech training methods for robust DNN-HMM distant speech recognition

This paper revise this classical approach in the context of modern DNN-HMM systems, and proposes the adoption of three methods, namely, asymmetric context windowing, close- talk based supervision, and close-talk based pre-training, which show a significant advantage in using these three methods.

Hybrid speech recognition with Deep Bidirectional LSTM

The hybrid approach with DBLSTM appears to be well suited for tasks where acoustic modelling predominates, and the improvement in word error rate over the deep network is modest, despite a great increase in framelevel accuracy.

Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks

Several integration architectures are proposed and tested, including a pipeline architecture of L STM-based SE and ASR with sequence training, an alternating estimation architecture, and a multi-task hybrid LSTM network architecture.

Speech recognition with deep recurrent neural networks

This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.

The MERL/MELCO/TUM system for the REVERB Challenge using Deep Recurrent Neural Network Feature Enhancement

The proposed ASR system with eight-channel input and feature enhancement achieves average word error rates (WERs) of 7.75 % and 20.09 % on the simulated and real evaluation sets, which is a drastic improvement over the Challenge baseline.

RNNDROP: A novel dropout for RNNS in ASR

Recently, recurrent neural networks (RNN) have achieved the state-of-the-art performance in several applications that deal with temporal data, e.g., speech recognition, handwriting recognition and

Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR

It is demonstrated that LSTM speech enhancement, even when used 'naively' as front-end processing, delivers competitive results on the CHiME-2 speech recognition task.

Long short-term memory recurrent neural network architectures for large scale acoustic modeling

The first distributed training of LSTM RNNs using asynchronous stochastic gradient descent optimization on a large cluster of machines is introduced and it is shown that a two-layer deep LSTm RNN where each L STM layer has a linear recurrent projection layer can exceed state-of-the-art speech recognition performance.