Densely Connected Networks for Conversational Speech Recognition

  title={Densely Connected Networks for Conversational Speech Recognition},
  author={Kyu J. Han and Akshay Chandrashekaran and Jungsuk Kim and Ian R. Lane},
In this paper we show how we have achieved the state-of-theart performance on the industry-standard NIST 2000 Hub5 English evaluation set. We propose densely connected LSTMs (namely, dense LSTMs), inspired by the densely connected convolutional neural networks recently introduced for image classification tasks. It is shown that the proposed dense LSTMs would provide more reliable performance as compared to the conventional, residual LSTMs as more LSTM layers are stacked in neural networks. With… 

Figures and Tables from this paper

ASAPP-ASR: Multistream CNN and Self-Attentive SRU for SOTA Speech Recognition

In this paper we present state-of-the-art (SOTA) performance on the LibriSpeech corpus with two novel neural network architectures, a multistream CNN for acoustic modeling and a self-attentive simple

Signal Processing for Communication Understanding and Behavior Analysis ( SCUBA )

This work introduces a speaker diarization system that can directly integrate lexical as well as acoustic information into a speaker clustering process and proposes an adjacency matrix integration technique to integrate word level speaker turn probabilities with speaker embeddings in a comprehensive way.

Speaker Diarization With Lexical Information

This work proposes a speaker diarization system that can incorporate word-level speaker turn probabilities with speaker embeddings into a speaker clustering process to improve the overall darization accuracy and introduces an adjacency matrix integration for spectral clustering.

Earnings-21: A Practical Benchmark for ASR in the Wild

It is found that ASR accuracy for certain NER categories is poor, present-ing a significant impediment to transcript comprehension and usage.

Deep Learning for Human Affect Recognition: Insights and New Developments

This paper reviews the literature on human affect recognition between 2010 and 2017, with a special focus on approaches using deep neural networks, and finds that deep learning is used for learning of spatial feature representation, temporal feature representations, and joint feature representations for multimodal sensor data.

Utterance-level Permutation Invariant Training with Latency-controlled BLSTM for Single-channel Multi-talker Speech Separation

Using latency-controlled BLSTM (LC-BLSTM) during inference to fulfill low-latency and good-performance speech separation and it is found that the inter-chunk speaker tracing (ST) can further improve the separation performance of uPIT-LC-blSTM.



The Microsoft 2017 Conversational Speech Recognition System

We describe the latest version of Microsoft's conversational speech recognition system for the Switchboard and CallHome domains. The system adds a CNN-BLSTM acoustic model to the set of model

Achieving Human Parity in Conversational Speech Recognition

The human error rate on the widely used NIST 2000 test set is measured, and the latest automated speech recognition system has reached human parity, establishing a new state of the art, and edges past the human benchmark.

An improved residual LSTM architecture for acoustic modeling

This paper proposes several types of residual LSTM methods for acoustic modeling that could have good results on THCHS-30, Librispeech and Switchboard corpora, and indicates that this architecture shows more than 8% relative reduction in Phone Error Rate (PER) on TIMIT tasks.

Language modeling with highway LSTM

Experimental results on English broadcast news and conversational telephone speech recognition show that the proposed HW-LSTM LM improves speech recognition accuracy on top of a strong L STM LM baseline.

Deep Learning-Based Telephony Speech Recognition in the Wild

This paper explores the effectiveness of a variety of Deep Learning-based acoustic models for conversational telephony speech, specifically TDNN, bLSTM and CNN-bL STM models, and performs an error analysis on the real-world data and highlights the areas where speech recognition still has challenges.

Highway-LSTM and Recurrent Highway Networks for Speech Recognition

Novel Highway-LSTM models with bottlenecks skip connections are experiment with and it is shown that a 10 layer model can outperform a state-of-the-art 5 layer LSTM model with the same number of parameters by 2% relative WER.

An Exploration of Dropout with LSTMs

This paper describes extensive experiments in which the best way to combine dropout with LSTMs– specifically, projected LST Ms (LSTMP) is investigated, giving consistent improvements in WER across a range of datasets, including Switchboard, TED-LIUM and AMI.

Highway long short-term memory RNNS for distant speech recognition

This paper extends the deep long short-term memory (DL-STM) recurrent neural networks by introducing gated direct connections between memory cells in adjacent layers, and introduces the latency-controlled bidirectional LSTMs (BLSTMs) which can exploit the whole history while keeping the latency under control.

Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI

A method to perform sequencediscriminative training of neural network acoustic models without the need for frame-level cross-entropy pre-training is described, using the lattice-free version of the maximum mutual information (MMI) criterion: LF-MMI.

Speaker adaptation of context dependent deep neural networks

  • H. Liao
  • Computer Science
    2013 IEEE International Conference on Acoustics, Speech and Signal Processing
  • 2013
This work explores how deep neural networks may be adapted to speakers by re-training the input layer, the output layer or the entire network, and looks at how L2 regularization using weight decay to the speaker independent model improves generalization.