The 2020 ESPnet Update: New Features, Broadened Applications, Performance Improvements, and Future Plans

@article{Watanabe2021The2E,
  title={The 2020 ESPnet Update: New Features, Broadened Applications, Performance Improvements, and Future Plans},
  author={Shinji Watanabe and Florian Boyer and Xuankai Chang and Pengcheng Guo and Tomoki Hayashi and Yosuke Higuchi and Takaaki Hori and Wen-Chin Huang and Hirofumi Inaguma and Naoyuki Kamo and Shigeki Karita and Chenda Li and Jing Shi and Aswin Shanmugam Subramanian and Wangyou Zhang},
  journal={2021 IEEE Data Science and Learning Workshop (DSLW)},
  year={2021},
  pages={1-6}
}
This paper describes the recent development of ESPnet (https://github.com/espnet/espnet), an end-to-end speech processing toolkit. This project was initiated in December 2017 to mainly deal with end-to-end speech recognition experiments based on sequence-to-sequence modeling. The project has grown rapidly and now covers a wide range of speech processing applications. Now ESPnet also includes text to speech (TTS), voice conversation (VC), speech translation (ST), and speech enhancement (SE) with… 

Figures and Tables from this paper

A Comparative Study on Non-Autoregressive Modelings for Speech-to-Text Generation
TLDR
A comparative study of various NAR modeling methods for end-to-end automatic speech recognition (ASR) provides interesting findings for developing an understanding of NAR ASR, such as the accuracy-speed trade-off and robustness against long-form utterances.
Human Listening and Live Captioning: Multi-Task Training for Speech Enhancement
TLDR
A multi-task training framework to make the monaural speech enhancement models unharmful to ASR and improves the word error rate for the SE output by 11.82% with little compromise in the SE quality is proposed.
Domain Adaptation of low-resource Target-Domain models using well-trained ASR Conformer Models
TLDR
Dom adaptation for low-resource Automatic Speech Recognition of target-domain data, when a well-trained ASR model trained with a large dataset is available, is investigated and it is shown that applying Spectral Augmentation on the proposed features provides a further improvement on the target- domain performance.
Integrate Lattice-Free MMI into End-to-End Speech Recognition
TLDR
Novel algorithms are proposed in this work to integrate an- other widely used discriminative criterion, lattice-free maximum mutual information (LF-MMI), into E2E ASR systems not only in the training stage but also in the decoding process.
Improving Noise Robustness of Contrastive Speech Representation Learning with Speech Reconstruction
TLDR
This work proposes to combine a reconstruction module with contrastive learning and perform multi-task continual pre-training on noisy data to improve the noise robustness of the learned representation and thus is not required during inference.
The CPQD-Unicamp system for Blizzard Challenge 2021
TLDR
The CPQD-UNICAMP text-to-speech system for Blizzard Challenge 2021 consists of a bilingual linguistic front-end, an acoustic model based on Tacotron2 and a Parallel Wavegan neural vocoder.
Investigating Self-supervised Pretraining Frameworks for Pathological Speech Recognition
TLDR
This work investigates self-supervised pretraining frameworks such as the wav2vec 2.0 and WavLM models using different setups and compares their performance with different supervised pretraining setups, using two types of pathological speech, namely, Japanese electrolaryngeal and English dysarthric.
A Conformer Based Acoustic Model for Robust Automatic Speech Recognition
TLDR
This study addresses robust automatic speech recognition (ASR) by introducing a Conformer-based acoustic model that is 18.3% smaller in model size and reduces training time by 88.5%.
Magic dust for cross-lingual adaptation of monolingual wav2vec-2.0
TLDR
A key finding of this work is that the adapted monolingual wav2vec-2.0 achieves similar performance as the topline multilingual XLSR model, which is trained on fifty-three languages, on the target language ASR task.
A Preliminary Study of a Two-Stage Paradigm for Preserving Speaker Identity in Dysarthric Voice Conversion
TLDR
A novel, two-stage approach for DVC that converts the speaker identity of the reference speech back to that of the patient while assumed to be capable of preserving the enhanced quality and investigates several design options.
...
1
2
...

References

SHOWING 1-10 OF 77 REFERENCES
Recent Developments on Espnet Toolkit Boosted By Conformer
TLDR
This paper shows the results for a wide range of end- to-end speech processing applications, such as automatic speech recognition (ASR), speech translations (ST), speech separation (SS) and text-to-speech (TTS).
ESPnet-SE: End-To-End Speech Enhancement and Separation Toolkit Designed for ASR Integration
TLDR
The design of the toolkit, several important functionalities, especially the speech recognition integration, which differentiates ESPnet-SE from other open source toolkits, and experimental results with major benchmark datasets are described.
ESPnet: End-to-End Speech Processing Toolkit
TLDR
A major architecture of this software platform, several important functionalities, which differentiate ESPnet from other open source ASR toolkits, and experimental results with major ASR benchmarks are explained.
Espnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit
TLDR
The experimental results show that the ESPnet-TTS models can achieve state-of-the-art performance comparable to the other latest toolkits, resulting in a mean opinion score (MOS) of 4.25 on the LJSpeech dataset.
End-To-End Multi-Speaker Speech Recognition With Transformer
TLDR
This work replaces the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture, and incorporates an external dereverberation preprocessing, the weighted prediction error (WPE), enabling the model to handle reverberated signals.
Close to Human Quality TTS with Transformer
TLDR
This paper introduces and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2, and achieves state-of-the-art performance and close to human quality.
Espresso: A Fast End-to-End Neural Speech Recognition Toolkit
TLDR
Espresso achieves state-of-the-art ASR performance on the WSJ, LibriSpeech, and Switchboard data sets among other end-to-end systems without data augmentation, and is 4-11x faster for decoding than similar systems (e.g. ESPNET).
RETURNN as a Generic Flexible Neural Toolkit with Application to Translation and Speech Recognition
TLDR
It is shown that a layer-wise pretraining scheme for recurrent attention models gives over 1% BLEU improvement absolute and it allows to train deeper recurrent encoder networks.
The Sequence-to-Sequence Baseline for the Voice Conversion Challenge 2020: Cascading ASR and TTS
TLDR
This paper revisits a naive approach for voice conversion by utilizing ESPnet, an open-source end-to-end speech processing toolkit, and the many well-configured pretrained models provided by the community, demonstrating the promising ability of seq2seq models to convert speaker identity.
FastSpeech: Fast, Robust and Controllable Text to Speech
TLDR
A novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS is proposed, which speeds up mel-Spectrogram generation by 270x and the end-to-end speech synthesis by 38x and is called FastSpeech.
...
1
2
3
4
5
...