• Corpus ID: 240354134

Visualization: The Missing Factor in Simultaneous Speech Translation

  title={Visualization: The Missing Factor in Simultaneous Speech Translation},
  author={Sara Papi and Matteo Negri and Marco Turchi},
Simultaneous speech translation (SimulST) is the task in which output generation has to be performed on partial, incremental speech input. In recent years, SimulST has become popular due to the spread of multilingual application scenarios, like international live conferences and streaming lectures, in which on-the-fly speech translation can facilitate users’ access to audio-visual content. In this paper, we analyze the characteristics of the SimulST systems developed so far, discussing their… 

Tables from this paper

Over-Generation Cannot Be Rewarded: Length-Adaptive Average Lagging for Simultaneous Speech Translation
LAAL (Length-Adaptive Average Lagging), a modified version of the metric that takes into account the over-generation phenomenon and allows for unbiased evaluation of both under-/over-generating systems is proposed.


Simultaneous Speech Translation for Live Subtitling: from Delay to Display
It is argued that simultaneous translation for readable live subtitles still faces challenges, the main one being poor translation quality, and proposed directions for steering future research are argued.
SIMULEVAL: An Evaluation Toolkit for Simultaneous Translation
SimulEval is an easy-to-use and general evaluation toolkit for both simultaneous text and speech translation and is equipped with a visualization interface to provide better understanding of the simultaneous decoding process of a system.
Dynamic Transcription for Low-Latency Speech Translation
A novel scheme which reduces the latency of a large scale speech translation system drastically and within this scheme, the transcribed text and its translation can be updated when more context is available, even after they are presented to the user.
Direct Simultaneous Speech-to-Text Translation Assisted by Synchronized Streaming ASR
This work proposes a new paradigm to use two separate, but synchronized, decoders on streaming ASR and direct speech-to-text translation (ST), respectively, and the intermediate results of ASR guide the decoding policy of ST.
Re-Translation Strategies for Long Form, Simultaneous, Spoken Language Translation
This work investigates the problem of simultaneous machine translation of long-form speech content with a continuous speech-to-text scenario, generating translated captions for a live audio feed, and adopts a re-translation approach to simultaneous translation.
Re-translation versus Streaming for Simultaneous Translation
This work finds re-translation to be as good or better than state-of-the-art streaming systems, even when operating under constraints that allow very few revisions.
Towards the evaluation of automatic simultaneous speech translation from a communicative perspective
An experiment aimed at evaluating the quality of a real-time speech translation engine by comparing it to the performance of professional simultaneous interpreters and adopts a framework developed for the assessment of human interpreters to perform a manual evaluation on both human and machine performances.
RealTranS: End-to-End Simultaneous Speech Translation with Convolutional Weighted-Shrinking Transformer
RealTranS, an end-to-end model for SST that gradually downsamples the input speech with interleaved convolution and unidirectional Transformer layers for acoustic modeling, and then maps speech features into text space with a weighted-shrinking operation and a semantic encoder is proposed.
Impact of Encoding and Segmentation Strategies on End-to-End Simultaneous Speech Translation
This paper extends the previ-ously proposed end-to-end online decoding strategy and shows that while replacing BLSTM by ULSTM encoding degrades performance in offline mode, it actually improves both efficiency and performance in online mode.
DuTongChuan: Context-aware Translation Model for Simultaneous Interpreting
This model allows to constantly read streaming text from the Automatic Speech Recognition model and simultaneously determine the boundaries of Information Units (IUs) one after another and achieves promising translation quality, specially in the sense of surprisingly good discourse coherence.