ESPnet-ONNX: Bridging a Gap Between Research and Production

  title={ESPnet-ONNX: Bridging a Gap Between Research and Production},
  author={Masao Someki and Yosuke Higuchi and Tomoki Hayashi and Shinji Watanabe},
  journal={2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)},
In the field of deep learning, researchers often focus on inventing novel neural network models and improving benchmarks. In contrast, application developers are interested in making models suitable for actual products, which involves optimizing a model for faster inference and adapting a model to various platforms (e.g., C++ and Python). In this work, to fill the gap between the two, we establish an effective procedure for optimizing a PyTorch-based research-oriented model for deployment… 

Figures and Tables from this paper



Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

An Empirical Study of Challenges in Converting Deep Learning Models

The general message of the findings is that DL developers should be cautious on the deployment of converted models that may 1) perform poorly while switching from one framework to another, 2) have challenges in robust deployment, or 3) run slowly, leading to poor quality of deployed DL-based software, including DL- based software maintenance tasks, like bug prediction.

ESPnet-SLU: Advancing Spoken Language Understanding Through ESPnet

This work enhances the toolkit to provide implementations for various SLU benchmarks that enable researchers to seamlessly mix-and-match different ASR and NLU models, and provides pretrained models with intensively tuned hyper-parameters that can match or even outperform the current state-of-the-art performances.

A Comparative Study on Transformer vs RNN in Speech Applications

An emergent sequence-to-sequence model called Transformer achieves state-of-the-art performance in neural machine translation and other natural language processing applications, including the surprising superiority of Transformer in 13/15 ASR benchmarks in comparison with RNN.

Recent Developments on Espnet Toolkit Boosted By Conformer

This paper shows the results for a wide range of end- to-end speech processing applications, such as automatic speech recognition (ASR), speech translations (ST), speech separation (SS) and text-to-speech (TTS).

Layer Pruning on Demand with Intermediate CTC

This work proposes a training and pruning method for ASR based on the connectionist temporal classification (CTC) which allows reduction of model depth at run-time without any ex-tra fine-tuning, and shows that a Transformer-CTC model can be pruned in various depth on demand, improving real-time factor from 0.005 to 0.002 on GPU.

Are Sixteen Heads Really Better than One?

It is made the surprising observation that even if models have been trained using multiple heads, in practice, a large percentage of attention heads can be removed at test time without significantly impacting performance.

WeNet: Production Oriented Streaming and Non-Streaming End-to-End Speech Recognition Toolkit

An open source speech recognition toolkit called WeNet is proposed, in which a new two-pass approach named U2 is implemented to unify streaming and non-streaming end-to-end (E2E) speech recognition in a single model.

Low-bit Shift Network for End-to-End Spoken Language Understanding

In order to mitigate the high computation, memory, and power requirements of inferring convolutional neural networks (CNNs), this work proposes the use of power-of-two quantization, which quantizes continuous parameters into low-bit power- of-two values.

SUPERB: Speech processing Universal PERformance Benchmark

A simple framework to solve SUPERB tasks by learning task-specialized lightweight prediction heads on top of the frozen shared model for its preferable re-usability and results demonstrate that the framework is promising as SSL representations show competitive generalizability and accessibility across SuperB tasks.