Building state-of-the-art distant speech recognition using the CHiME-4 challenge with a setup of speech enhancement baseline

@article{Chen2018BuildingSD,
  title={Building state-of-the-art distant speech recognition using the CHiME-4 challenge with a setup of speech enhancement baseline},
  author={Szu-Jui Chen and Aswin Shanmugam Subramanian and Hainan Xu and Shinji Watanabe},
  journal={ArXiv},
  year={2018},
  volume={abs/1803.10109}
}
This paper describes a new baseline system for automatic speech recognition (ASR) in the CHiME-4 challenge to promote the development of noisy ASR in speech processing communities by providing 1) state-of-the-art system with a simplified single system comparable to the complicated top systems in the challenge, 2) publicly available and reproducible recipe through the main repository in the Kaldi speech recognition toolkit. [] Key Method The proposed system adopts generalized eigenvalue beamforming with…

Figures from this paper

Investigation of Practical Aspects of Single Channel Speech Separation for ASR
TLDR
This paper investigates a two-stage training scheme that applies a feature level optimization criterion for pre-training, followed by an ASR-oriented optimization criterion using an end-to-end (E2E) speech recognition model, and introduces a modi-student learning technique for model compression to keep the model light-weight.
Multi-Variant Consistency based Self-supervised Learning for Robust Automatic Speech Recognition
TLDR
The robust ASR is addressed by introducing a multi- variant consistency (MVC) based SSL method that adapts to different environments and can achieve up to 30% relative word error rate reductions over the baseline wav2vec2.0, one of the most successful SSL methods for ASR.
Densenet Blstm for Acoustic Modeling in Robust ASR
TLDR
The DenseNet topology is modified to become a kind of feature extractor for the subsequent BLSTM network operating on whole speech utterances and is able to consistently outperform a top-performing baseline based on wide residual networks and BLSTMs providing a 2.4% relative WER reduction on the real test set.
End-to-End Integration of Speech Recognition, Speech Enhancement, and Self-Supervised Learning Representation
TLDR
The proposed end-to-end E2E ASR model targetting at robust speech recognition with enhanced speech Input for Self-supervised learning representation (IRIS) achieves the best performance reported in the literature for the single-channel CHiME-4 benchmark.
A Speech Enhancement Neural Network Architecture with SNR-Progressive Multi-Target Learning for Robust Speech Recognition
TLDR
The proposed LSTM-based PMT network, with the best configuration, outperforms the PRM-only model, with a relative WER reduction of 13.31% (further down to 22.48 %) averaging over the same test set.
Joint Training of Complex Ratio Mask Based Beamformer and Acoustic Model for Noise Robust Asr
  • Yong Xu, Chao Weng, Dong Yu
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
TLDR
The complex ratio mask (CRM) is proposed to estimate the covariance matrix for the beamformer and a long short-term memory (LSTM) based language model is utilized to re-score hypotheses which further improves the overall performance.
Improving Character Error Rate Is Not Equal to Having Clean Speech: Speech Enhancement for ASR Systems with Black-box Acoustic Models
TLDR
A deep neural network (DNN)-based speech enhancement (SE) aiming to maximize the performance of an automatic speech recognition (ASR) system is proposed in this paper, which uses two DNNs: one for speech processing and one for mimicking the output CERs derived through an acoustic model (AM).
A Progressive Learning Approach to Adaptive Noise and Speech Estimation for Speech Enhancement and Noisy Speech Recognition
In this paper, we propose a progressive learning-based adaptive noise and speech estimation (PL-ANSE) method for speech preprocessing in noisy speech recognition, leveraging upon a frame-level noise
Improving Noise Robust Automatic Speech Recognition with Single-Channel Time-Domain Enhancement Network
TLDR
It is shown that a single-channel time-domain denoising approach can significantly improve ASR performance, providing more than 30 % relative word error reduction over a strong ASR back-end on the real evaluation data of the single- channel track of the CHiME-4 dataset.
A Cross-Entropy-Guided (CEG) Measure for Speech Enhancement Front-End Assessing Performances of Back-End Automatic Speech Recognition
TLDR
A novel cross-entropy-guided (CEG) measure is proposed for assessing if enhanced speech predicted by a speech enhancement algorithm would produce a good performance for robust ASR and could be adopted to guide the parameter optimization of deep learning based speech enhancement algorithms to further improve the ASR performance.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 32 REFERENCES
Multi-Channel Speech Recognition : LSTMs All the Way Through
TLDR
An LSTM “triple threat” system for speech recognition, where LSTMs drive the three main subsystems: microphone array processing, acoustic modeling, and language modeling is applied to the CHiME-4 distant recognition challenge.
The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices
TLDR
NTT's CHiME-3 system is described, which integrates advanced speech enhancement and recognition techniques, which achieves a 3.45% development error rate and a 5.83% evaluation error rate.
The RWTH/UPB/FORTH System Combination for the 4th CHiME Challenge Evaluation
TLDR
This paper describes automatic speech recognition systems developed jointly by RWTH, UPB and FORTH for the 1ch, 2ch and 6ch track of the 4th CHiME Challenge and compares the ASR performance of different beamforming approaches.
The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines
TLDR
The 5th CHiME Challenge is introduced, which considers the task of distant multi-microphone conversational ASR in real home environments and describes the data collection procedure, the task, and the baseline systems for array synchronization, speech enhancement, and conventional and end-to-end ASR.
The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines
TLDR
The design and outcomes of the 3rd CHiME Challenge, which targets the performance of automatic speech recognition in a real-world, commercially-motivated scenario: a person talking to a tablet device that has been fitted with a six-channel microphone array, are presented.
A summary of the REVERB challenge: state-of-the-art and remaining challenges in reverberant speech processing research
TLDR
The REVERB challenge is described, which is an evaluation campaign that was designed to evaluate such speech enhancement and ASR techniques to reveal the state-of-the-art techniques and obtain new insights regarding potential future research directions.
Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI
TLDR
A method to perform sequencediscriminative training of neural network acoustic models without the need for frame-level cross-entropy pre-training is described, using the lattice-free version of the maximum mutual information (MMI) criterion: LF-MMI.
An analysis of environment, microphone and data simulation mismatches in robust speech recognition
The second ‘chime’ speech separation and recognition challenge: Datasets, tasks and baselines
TLDR
This paper is intended to be a reference on the 2nd `CHiME' Challenge, an initiative designed to analyze and evaluate the performance of ASR systems in a real-world domestic environment.
...
1
2
3
4
...