Speech Segmentation Optimization using Segmented Bilingual Speech Corpus for End-to-end Speech Translation

  title={Speech Segmentation Optimization using Segmented Bilingual Speech Corpus for End-to-end Speech Translation},
  author={Ryo Fukuda and Katsuhito Sudoh and Satoshi Nakamura},
Speech segmentation, which splits long speech into short segments, is essential for speech translation (ST). Popular VAD tools like WebRTC VAD 1 have generally relied on pause-based segmentation. Unfortunately, pauses in speech do not necessarily match sentence boundaries, and sentences can be connected by a very short pause that is difficult to detect by VAD. In this study, we propose a speech segmentation method using a bi-nary classification model trained using a segmented bilingual speech… 

Figures and Tables from this paper


A semi-Markov model for speech segmentation with an utterance-break prior
A novel semi-Markov model is developed which allows the segmentation of audio streams into speech utterances which are optimised for the desired distribution of sentence lengths for the target domain.
Beyond Voice Activity Detection: Hybrid Audio Segmentation for Direct Speech Translation
This paper proposes enhanced hybrid solutions to produce better results without sacrificing latency in direct speech translation by reducing by at least 30% the gap between the traditional VAD-based approach and optimal manual segmentation.
Improving speech translation with automatic boundary prediction
This paper uses prosodic and lexical cues to determine sentence boundaries, and successfully combine two complementary approaches to sentence boundary prediction, and introduces a new feature for segmentation prediction that directly considers the assumptions of the phrase translation model.
Direct Segmentation Models for Streaming Speech Translation
This work proposes novel segmentation models for streaming ST that incorporate not only textual, but also acoustic information to decide when the ASR output is split into a chunk.
ESPnet-ST: All-in-One Speech Translation Toolkit
ESnet-ST is a new project inside end-to-end speech processing toolkit, ESPnet, which integrates or newly implements automatic speech recognition, machine translation, and text-to -speech functions for speech translation.
Segmentation Strategies for Streaming Speech Translation
The study presented in this work is a first effort at real-time speech translation of TED talks, a compendium of public talks with different speakers addressing a variety of topics, and demonstrates that a good segmentation is useful, and a novel conjunction-based segmentation strategy improves translation quality nearly as much as other strategies such as comma- based segmentation.
ESPnet-ST IWSLT 2021 Offline Speech Translation System
The ESPnet-ST group’s IWSLT 2021 submission in the offline speech translation track is described, which adopted the Conformer encoder and the Multi-Decoder architecture, which equips dedicated decoders for speech recognition and translation tasks in a unified encoder-decoder model and enables search in both source and target language spaces during inference.
Online Sentence Segmentation for Simultaneous Interpretation using Multi-Shifted Recurrent Neural Network
A multishifted RNN is proposed to address the trade-off between accuracy and latency, which is one of the key characteristics of the task.
End-to-End Speech Translation with Pre-trained Models and Adapters: UPC at IWSLT 2021
This paper describes the submission to the IWSLT 2021 offline speech translation task by the UPC Machine Translation group, which is an end-to-end speech translation system, which combines pre-trained models with coupling modules between the encoder and decoder, and uses an efficient fine-tuning technique.
MuST-C: a Multilingual Speech Translation Corpus
MuST-C is created, a multilingual speech translation corpus whose size and quality will facilitate the training of end-to-end systems for SLT from English into 8 languages and an empirical verification of its quality and SLT results computed with a state-of-the-art approach on each language direction.