• Corpus ID: 238583665

A Comparative Study on Non-Autoregressive Modelings for Speech-to-Text Generation

  title={A Comparative Study on Non-Autoregressive Modelings for Speech-to-Text Generation},
  author={Yosuke Higuchi and Nanxin Chen and Yuya Fujita and Hirofumi Inaguma and Tatsuya Komatsu and Jaesong Lee and Jumon Nozaki and Tianzi Wang and Shinji Watanabe},
Non-autoregressive (NAR) models simultaneously generate multiple outputs in a sequence, which significantly reduces the inference speed at the cost of accuracy drop compared to autoregressive baselines. Showing great potential for real-time applications, an increasing number of NAR models have been explored in different fields to mitigate the performance gap against AR models. In this work, we conduct a comparative study of various NAR modeling methods for end-to-end automatic speech… 

Figures and Tables from this paper


Align-Denoise: Single-Pass Non-Autoregressive Speech Recognition
Deep autoregressive models start to become comparable or superior to the conventional systems for automatic speech recognition. However, for the inference computation, they still suffer from
Recent Developments on Espnet Toolkit Boosted By Conformer
This paper shows the results for a wide range of end- to-end speech processing applications, such as automatic speech recognition (ASR), speech translations (ST), speech separation (SS) and text-to-speech (TTS).
State-of-the-Art Speech Recognition with Sequence-to-Sequence Models
  • C. Chiu, T. Sainath, +11 authors M. Bacchiani
  • Computer Science, Engineering
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
A variety of structural and optimization improvements to the Listen, Attend, and Spell model are explored, which significantly improve performance and a multi-head attention architecture is introduced, which offers improvements over the commonly-used single- head attention.
Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech Recognition
This work proposes a spike-triggered non-autoregressive transformer model for end-to-end speech recognition, which introduces a CTC module to predict the length of the target sequence and accelerate the convergence.
Attention-Based Models for Speech Recognition
The attention-mechanism is extended with features needed for speech recognition and a novel and generic method of adding location-awareness to the attention mechanism is proposed to alleviate the issue of high phoneme error rate.
Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition
  • Linhao Dong, Shuang Xu, Bo Xu
  • Computer Science
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
The Speech-Transformer is presented, a no-recurrence sequence-to-sequence model entirely relies on attention mechanisms to learn the positional dependencies, which can be trained faster with more efficiency and a 2D-Attention mechanism which can jointly attend to the time and frequency axes of the 2-dimensional speech inputs, thus providing more expressive representations for the Speech- Transformer.
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.
The 2020 ESPnet Update: New Features, Broadened Applications, Performance Improvements, and Future Plans
The recent development of ESPnet is described, an end-to-end speech processing toolkit that includes text to speech (TTS), voice conversation (VC), speech translation (ST), and speech enhancement (SE) with support for beamforming, speech separation, denoising, and dereverberation.
Align-Refine: Non-Autoregressive Speech Recognition via Iterative Realignment
Align-Refine is an end-to-end Transformer which iteratively realigns connectionist temporal classification (CTC) alignments and reaches an LM-free test-other WER of 9.0% in three iterations.
Audio augmentation for speech recognition
This paper investigates audio-level speech augmentation methods which directly process the raw signal, and presents results on 4 different LVCSR tasks with training data ranging from 100 hours to 1000 hours, to examine the effectiveness of audio augmentation in a variety of data scenarios.