ESPnet-SE: End-To-End Speech Enhancement and Separation Toolkit Designed for ASR Integration

@article{Li2021ESPnetSEES,
  title={ESPnet-SE: End-To-End Speech Enhancement and Separation Toolkit Designed for ASR Integration},
  author={Chenda Li and Jing Shi and Wangyou Zhang and Aswin Shanmugam Subramanian and Xuankai Chang and Naoyuki Kamo and Moto Hira and Tomoki Hayashi and Christoph Boeddeker and Zhuo Chen and Shinji Watanabe},
  journal={2021 IEEE Spoken Language Technology Workshop (SLT)},
  year={2021},
  pages={785-792}
}
We present ESPnet-SE, which is designed for the quick development of speech enhancement and speech separation systems in a single framework, along with the optional downstream speech recognition module. ESPnet-SE is a new project which integrates rich automatic speech recognition related models, resources and systems to support and validate the proposed front-end implementation (i.e. speech enhancement and separation). It is capable of processing both single-channel and multi-channel data, with… 

Figures and Tables from this paper

ESPnet-SE++: Speech Enhancement for Robust Speech Recognition, Translation, and Understanding
TLDR
Results show that the integration of SE front-ends with back-end tasks is a promising research direction even for tasks besides ASR, especially in the multi-channel scenario.
The 2020 ESPnet Update: New Features, Broadened Applications, Performance Improvements, and Future Plans
TLDR
The recent development of ESPnet is described, an end-to-end speech processing toolkit that includes text to speech (TTS), voice conversation (VC), speech translation (ST), and speech enhancement (SE) with support for beamforming, speech separation, denoising, and dereverberation.
ESPnet-SLU: Advancing Spoken Language Understanding through ESPnet
TLDR
This work enhances the toolkit to provide implementations for various SLU benchmarks that enable researchers to seamlessly mix-and-match different ASR and NLU models, and provides pretrained models with intensively tuned hyper-parameters that can match or even outperform the current state-of-the-art performances.
ESPnet-ST IWSLT 2021 Offline Speech Translation System
TLDR
The ESPnet-ST group’s IWSLT 2021 submission in the offline speech translation track is described, which adopted the Conformer encoder and the Multi-Decoder architecture, which equips dedicated decoders for speech recognition and translation tasks in a unified encoder-decoder model and enables search in both source and target language spaces during inference.
Dual-Path Modeling for Long Recording Speech Separation in Meetings
  • Chenda LiZhuo Chen Y. Qian
  • Computer Science
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
TLDR
A transformer-based dual- path system is proposed, which integrates transform layers for global modeling and significantly reduces the computation amount by 30% with better WER evaluation and the online processing dual-path models are investigated, which shows 10% relative WER reduction compared to the baseline.
SNRi Target Training for Joint Speech Enhancement and Recognition
TLDR
This study proposes “ signal-to-noise ratio improvement (SNRi) target training”; the SE frontend is trained to output a signal whose SNRi is controlled by an auxiliary scalar input, and observes the jointly trained network automatically controls the target SNRu according to noise characteristics.
An Exploration of Self-Supervised Pretrained Representations for End-to-End Speech Recognition
TLDR
This paper focuses on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models and explores more scenarios for whether the pre-training representations are effective, such as the cross-language or overlapped speech.
Closing the Gap Between Time-Domain Multi-Channel Speech Enhancement on Real and Simulation Conditions
TLDR
The experiments on the CHiME-4 corpus show that the proposed approaches can greatly reduce the speech recognition performance discrepancy between simulation and real data, while preserving the strong speech enhancement capability in the frontend.
Towards Low-distortion Multi-channel Speech Enhancement: The ESPNet-SE Submission to The L3DAS22 Challenge
TLDR
The proposed method, which combines Deep Neural Network driven complex spectral mapping with linear beamformers such as the multi-frame multi-channel Wiener filter, was ranked first in the challenge, achieving a ranking metric of 0.984, versus 0.833 of the challenge baseline.
SkiM: Skipping Memory LSTM for Low-Latency Real-Time Continuous Speech Separation
TLDR
This work proposes a simple yet efficient model named Skipping Memory (SkiM) for the long sequence modeling, which achieves on par or even better separation performance than DPRNN and the computational cost is reduced by 75% compared to DPRNN.
...
...

References

SHOWING 1-10 OF 55 REFERENCES
ESPnet: End-to-End Speech Processing Toolkit
TLDR
A major architecture of this software platform, several important functionalities, which differentiate ESPnet from other open source ASR toolkits, and experimental results with major ASR benchmarks are explained.
Onssen: an open-source speech separation and enhancement library
TLDR
The functionality of modules in onssen are described and the algorithms implemented by onSSen achieve the same performances as reported in the original papers.
Far-Field Location Guided Target Speech Extraction Using End-to-End Speech Recognition Objectives
  • A. SubramanianChao Weng Dong Yu
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
TLDR
This paper proposes a method to jointly optimize a location guided target speech extraction module along with a speech recognition module only with ASR error minimization criteria and designs a system that can take both location and anchor speech as input at the same time.
Speech Enhancement Using End-to-End Speech Recognition Objectives
TLDR
This paper uses a recently developed multichannel end-to-end (ME2E) system, which integrates neural dereverberation, beamforming, and attention-based speech recognition within a single neural network, and investigates how a system optimized based on the ASR objective improves the speech enhancement quality on various signal level metrics in addition to theASR word error rate (WER) metric.
End-To-End Multi-Speaker Speech Recognition With Transformer
TLDR
This work replaces the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture, and incorporates an external dereverberation preprocessing, the weighted prediction error (WPE), enabling the model to handle reverberated signals.
An Investigation of End-to-End Multichannel Speech Recognition for Reverberant and Mismatch Conditions
TLDR
This report uses a recently developed architecture for far-field ASR by composing neural extensions of dereverberation and beamforming modules with the S2S ASR module as a single differentiable neural network and also clearly defining the role of each subnetwork.
Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation
  • Yi LuoN. Mesgarani
  • Computer Science
    IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2019
TLDR
A fully convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time- domain speech separation, which significantly outperforms previous time–frequency masking methods in separating two- and three-speaker mixtures.
MIMO-Speech: End-to-End Multi-Channel Multi-Speaker Speech Recognition
TLDR
A novel neural sequence-to-sequence (seq2seq) architecture is proposed, MIMO-Speech, which extends the original seq2seq to deal with multi- Channel input and multi-channel output so that it can fully model multi-Channel multi-speaker speech separation and recognition.
Continuous Speech Separation: Dataset and Analysis
  • Zhuo ChenT. Yoshioka Jinyu Li
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
TLDR
A new real recording dataset, called LibriCSS, is derived from LibriSpeech by concatenating the corpus utterances to simulate conversations and capturing the audio replays with far-field microphones, which helps researchers from developing systems that can be readily applied to real scenarios.
TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation
  • Yi LuoN. Mesgarani
  • Computer Science
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
TLDR
Time-domain Audio Separation Network (TasNet) is proposed, which outperforms the current state-of-the-art causal and noncausal speech separation algorithms, reduces the computational cost of speech separation, and significantly reduces the minimum required latency of the output.
...
...