Streaming Simultaneous Speech Translation with Augmented Memory Transformer

  title={Streaming Simultaneous Speech Translation with Augmented Memory Transformer},
  author={Xutai Ma and Yongqiang Wang and Mohammad Javad Dousti and Philipp Koehn and Juan Miguel Pino},
  journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  • Xutai Ma, Yongqiang Wang, +2 authors J. Pino
  • Published 30 October 2020
  • Computer Science
  • ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Transformer-based models have achieved state-of-the-art performance on speech translation tasks. However, the model architecture is not efficient enough for streaming scenarios since self-attention is computed over an entire input sequence and the computational cost grows quadratically with the length of the input sequence. Nevertheless, most of the previous work on simultaneous speech translation, the task of generating translations from partial audio input, ignores the time spent in… Expand

Figures from this paper

Direct Simultaneous Speech-to-Text Translation Assisted by Synchronized Streaming ASR
This work proposes a new paradigm to use two separate, but synchronized, decoders on streaming ASR and direct speech-to-text translation (ST), respectively, and the intermediate results of ASR guide the decoding policy of ST. Expand
Incremental Speech Synthesis For Speech-To-Speech Translation
This work focuses on improving the incremental synthesis performance of TTS models, and proposes latency metrics tailored to S2ST applications, and investigates methods for latency reduction in this context. Expand
RealTranS: End-to-End Simultaneous Speech Translation with Convolutional Weighted-Shrinking Transformer
RealTranS, an end-to-end model for SST that gradually downsamples the input speech with interleaved convolution and unidirectional Transformer layers for acoustic modeling, and then maps speech features into text space with a weighted-shrinking operation and a semantic encoder is proposed. Expand
Direct simultaneous speech to speech translation
We present the first direct simultaneous speech-to-speech translation (Simul-S2ST) model, with the ability to start generating translation in the target speech before consuming the full source speechExpand
Beyond Sentence-Level End-to-End Speech Translation: Context Helps
This work conducts extensive experiments using a simple concatenationbased context-aware ST model, paired with adaptive feature selection on speech encodings for computational efficiency, and demonstrates the effectiveness of context to E2E speech translation. Expand
Automatic Simultaneous Translation Challenges , Recent Advances , and Future Directions
  • Liang Huang
  • 2021
Simultaneous translation (ST) outputs the translation simultaneously while reading the input sentence, which is an important component of simultaneous interpretation. In this paper, we describe ourExpand
Contextualize Knowledge Bases with Transformer for End-to-end Task-Oriented Dialogue Systems
This work proposes Context-aware Memory Enhanced Transformer (CMET), which can effectively aggregate information from the dialogue history and knowledge bases to generate more accurate responses and can achieve superior performance over the state-of-the-art methods. Expand
XMU’s Simultaneous Translation System at NAACL 2021
This paper describes our two systems submitted to the simultaneous translation evaluation at the 2nd automatic simultaneous translation workshop.
UniST: Unified End-to-end Model for Streaming and Non-streaming Speech Translation
  • Qianqian Dong, Yaoming Zhu, Mingxuan Wang, Lei Li
  • Computer Science, Engineering
  • ArXiv
  • 2021
Experiments on the most popular speech-to-text translation benchmark dataset, MuST-C, show that UniST achieves significant improvement for non-streaming ST, and a better-learned tradeoff for BLEU score and latency metrics for streaming ST, compared with end- to-end baselines and the cascaded models. Expand


Streaming Transformer-based Acoustic Models Using Self-attention with Augmented Memory
This work proposed a novel augmentedmemory self-attention, which attends on a short segment of the input sequence and a bank of memories, which stores the embedding information for all the processed seg-ments. Expand
SimulSpeech: End-to-End Simultaneous Speech to Text Translation
Experiments on MuST-C English-Spanish and English-German spoken language translation datasets show that SimulSpeech achieves reasonable BLEU scores and lower delay compared to full-sentence end-to-end speech to text translation (without simultaneous translation), and better performance than the two-stage cascaded simultaneous translation model in terms of BLEu scores and translation delay. Expand
Lecture Translator - Speech translation framework for simultaneous lecture translation
A system that performs the task of simultaneous speech translation of university lectures by performing speech translation on a stream of audio in real-time and with low latency and features several techniques beyond the basic speech translation task, that make it fit for real-world use. Expand
Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand
Monotonic Infinite Lookback Attention for Simultaneous Machine Translation
This work presents the first simultaneous translation system to learn an adaptive schedule jointly with a neural machine translation (NMT) model that attends over all source tokens read thus far, and shows that MILk’s adaptive schedule allows it to arrive at latency-quality trade-offs that are favorable to those of a recently proposed wait-k strategy for many latency values. Expand
MuST-C: a Multilingual Speech Translation Corpus
MuST-C is created, a multilingual speech translation corpus whose size and quality will facilitate the training of end-to-end systems for SLT from English into 8 languages and an empirical verification of its quality and SLT results computed with a state-of-the-art approach on each language direction. Expand
Simple, lexicalized choice of translation timing for simultaneous speech translation
This work proposes a method that uses lexicalized information to perform translation unit segmentation considering the relationship between the source and target languages and confirms that the proposed method significantly reduces delay for Japanese-English and FrenchEnglish translation. Expand
Learning to Translate in Real-time with Neural Machine Translation
A neural machine translation (NMT) framework for simultaneous translation in which an agent learns to make decisions on when to translate from the interaction with a pre-trained NMT environment is proposed. Expand
Monotonic Multihead Attention
This paper proposes a new attention mechanism, Monotonic Multihead Attention (MMA), which extends the monotonic attention mechanism to multihead attention and introduces two novel and interpretable approaches for latency control that are specifically designed for multiple attentions heads. Expand
Simultaneous translation of lectures and speeches
It is concluded that machines can already deliver comprehensible simultaneous translation output and while machine performance is affected by recognition errors (and thus can be improved), human performance is limited by the cognitive challenge of performing the task in real time. Expand