Acoustic Modeling with Densely Connected Residual Network for Multichannel Speech Recognition

@inproceedings{Tang2018AcousticMW,
  title={Acoustic Modeling with Densely Connected Residual Network for Multichannel Speech Recognition},
  author={Jian Tang and Yan Song and Lirong Dai and Ian Mcloughlin},
  booktitle={INTERSPEECH},
  year={2018}
}
Motivated by recent advances in computer vision research, this paper proposes a novel acoustic model called Densely Connected Residual Network (DenseRNet) for multichannel speech recognition. [] Key Method By concatenating the feature maps of all preceding layers as inputs, DenseRNet can not only strengthen gradient back-propagation for the vanishing-gradient problem, but also exploit multi-resolution feature maps. Preliminary experimental results on CHiME-3 have shown that DenseRNet achieves a word error…

Figures and Tables from this paper

Deep Residual-Dense Lattice Network for Speech Enhancement

TLDR
The residual-dense lattice network (RDL-Net), which is a new CNN for speech enhancement that employs both residual and dense aggregations without over-allocating parameters for feature re-usage, is proposed.

Multi-Quartznet: Multi-Resolution Convolution for Speech Recognition with Multi-Layer Feature Fusion

TLDR
The proposed Multi-QuartzNet model achieves CER 6.77% on AISHELL-1 data set, which outperforms original QuartzNet and is close to state-of-art result.

Jasper: An End-to-End Convolutional Neural Acoustic Model

TLDR
This paper reports state-of-the-art results on LibriSpeech among end-to-end speech recognition models without any external training data and introduces a new layer-wise optimizer called NovoGrad to improve training.

MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition

TLDR
MatchboxNet is a deep residual network composed from blocks of 1D time-channel separable convolution, batch-normalization, ReLU and dropout layers that reaches state-of-the-art accuracy on the Google Speech Commands dataset while having significantly fewer parameters than similar models.

Compressing 1D Time-Channel Separable Convolutions using Sparse Random Ternary Matrices

We demonstrate that 1x1-convolutions in 1D time-channel separable convolutions may be replaced by constant, sparse random ternary matrices with weights in {−1, 0,+1}. Such layers do not perform any

References

SHOWING 1-10 OF 25 REFERENCES

The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices

TLDR
NTT's CHiME-3 system is described, which integrates advanced speech enhancement and recognition techniques, which achieves a 3.45% development error rate and a 5.83% evaluation error rate.

Neural network based spectral mask estimation for acoustic beamforming

TLDR
A neural network based approach to acoustic beamforming is presented, used to estimate spectral masks from which the Cross-Power Spectral Density matrices of speech and noise are estimated, which are used to compute the beamformer coefficients.

Wide Residual BLSTM Network with Discriminative Speaker Adaptation for Robust Speech Recognition

TLDR
A system for the 4th CHiME challenge which significantly increases the performance for all three tracks with respect to the provided baseline system and is independent of the microphone configuration, i.e., a configuration which does not combine multiple systems.

Beamnet: End-to-end training of a beamformer-supported multi-channel ASR system

This paper presents an end-to-end training approach for a beamformer-supported multi-channel ASR system. A neural network which estimates masks for a statistically optimum beamformer is jointly

Densely Connected Convolutional Networks

TLDR
The Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion, and has several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters.

On time-frequency mask estimation for MVDR beamforming with application in robust speech recognition

TLDR
This paper focuses on the TF mask estimation using recurrent neural networks (RNN) and shows that the proposed methods improve the ASR performance individually and also work complementarily.

A speech enhancement algorithm by iterating single- and multi-microphone processing and its application to robust ASR

TLDR
The core of the algorithm estimates a time-frequency mask which represents the target speech and use masking-based beamforming to enhance corrupted speech and propose a masked-based post-filter to further suppress the noise in the output of beamforming.

The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines

TLDR
The design and outcomes of the 3rd CHiME Challenge, which targets the performance of automatic speech recognition in a real-world, commercially-motivated scenario: a person talking to a tablet device that has been fitted with a six-channel microphone array, are presented.

Deep Residual Learning for Image Recognition

TLDR
This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

Optimizing neural-network supported acoustic beamforming by algorithmic differentiation

TLDR
The tools developed in this paper are a key component for an end-to-end optimization of speech enhancement and speech recognition.