• Corpus ID: 237513632

UniST: Unified End-to-end Model for Streaming and Non-streaming Speech Translation

  title={UniST: Unified End-to-end Model for Streaming and Non-streaming Speech Translation},
  author={Qianqian Dong and Yaoming Zhu and Mingxuan Wang and Lei Li},
  • Qianqian Dong, Yaoming Zhu, +1 author Lei Li
  • Published 15 September 2021
  • Computer Science, Engineering
  • ArXiv
This paper presents a unified end-to-end framework for both streaming and non-streaming speech translation. While the training recipes for non-streaming speech translation have been mature, the recipes for streaming speech translation are yet to be built. In this work, we focus on developing a unified model (UniST) which supports streaming and non-streaming ST from the perspective of fundamental components, including training objective, attention mechanism and decoding policy. Experiments on… 

Figures and Tables from this paper


Leveraging Weakly Supervised Data to Improve End-to-end Speech-to-text Translation
  • Ye Jia, Melvin Johnson, +6 authors Yonghui Wu
  • Computer Science, Engineering
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
It is demonstrated that a high quality end-to-end ST model can be trained using only weakly supervised datasets, and that synthetic data sourced from unlabeled monolingual text or speech can be used to improve performance.
End-to-End Automatic Speech Translation of Audiobooks
Experimental results show that it is possible to train compact and efficient end-to-end speech translation models in this setup and hope that the speech translation baseline on this corpus will be challenged in the future.
Bridging the Modality Gap for Speech-to-Text Translation
This work decouple the speech translation encoder into three parts and introduces a shrink mechanism to match the length of speech representation with that of the corresponding text transcription, which achieves the new state-of-the-art performance.
Listen, Understand and Translate: Triple Supervision Decouples End-to-end Speech-to-text Translation
Listen-UnderstandTranslate, (LUT), a unified framework with triple supervision signals to decouple the end-to-end speech- to-text translation task, achieves the state-of-the-art performance, outperforming previous methods.
Consecutive Decoding for Speech-to-text Translation
This work proposes COnSecutive Transcription and Translation (COSTT), an integral framework for speech-to-text translation that outperforms the previous state-of-the-art methods.
Synchronous Speech Recognition and Speech-to-Text Translation with Interactive Decoding
A novel interactive attention mechanism is proposed which enables ASR and ST to perform synchronously and interactively in a single model and can outperform strong baselines on the quality of speech translation and achieve better speech recognition performances as well.
Data Augmentation for End-to-End Speech Translation: FBK@IWSLT ’19
This paper describes FBK’s submission to the end-to-end speech translation (ST) task at IWSLT 2019. The task consists in the “direct” translation (i.e. without intermediate discrete representation)
Curriculum Pre-training for End-to-End Speech Translation
This work proposes a curriculum pre-training method that includes an elementary course for transcription learning and two advanced courses for understanding the utterance and mapping words in two languages and shows that this method leads to significant improvements on En-De and En-Fr speech translation benchmarks.
Streaming Simultaneous Speech Translation with Augmented Memory Transformer
This paper proposes an end- to-end transformer-based sequence-to-sequence model, equipped with an augmented memory transformer encoder, which has shown great success on the streaming automatic speech recognition task with hybrid or transducer-based models.
Harnessing Indirect Training Data for End-to-End Automatic Speech Translation: Tricks of the Trade
Simple data augmentation by translating ASR transcripts proves most effective on the English--French augmented LibriSpeech dataset, closing the performance gap from 8.2 to 1.4 BLEU, compared to a very strong cascade that could directly utilize copious ASR and MT data.