Segmenting Subtitles for Correcting ASR Segmentation Errors

  title={Segmenting Subtitles for Correcting ASR Segmentation Errors},
  author={David Wan and Chris Kedzie and Faisal Ladhak and Elsbeth Turcan and Petra Galusc{\'a}kov{\'a} and Elena Zotkina and Zhengping Jiang and Peter Bell and Kathleen McKeown},
Typical ASR systems segment the input audio into utterances using purely acoustic information, which may not resemble the sentence-like units that are expected by conventional machine translation (MT) systems for Spoken Language Translation. In this work, we propose a model for correcting the acoustic segmentation of ASR models for low-resource languages to improve performance on downstream tasks. We propose the use of subtitles as a proxy dataset for correcting ASR acoustic segmentation… 

Figures and Tables from this paper

Speech Segmentation Optimization using Segmented Bilingual Speech Corpus for End-to-end Speech Translation
Experimental results reveal that the proposed speech segmentation method is more suitable for cascade and end-to-end ST systems than conventional segmentation methods, and the hybrid approach further improves the translation performance.


Subtitles to Segmentation: Improving Low-Resource Speech-to-TextTranslation Pipelines
This work focuses on improving ASR output segmentation in the context of low-resource language speech-to-text translation, and incorporates part-of-speech (POS) tag and dependency label information (derived from the unsegmented ASR outputs) into its segmentation model.
Optimizing sentence segmentation for spoken language translation
This work improves upon the previous work on automatically segmenting the ASR output in a way that isoptimized for translation and argues that it might be necessary for different stages of a Spoken Language Translation (SLT) system to define their own optimal units.
Segmentation Strategies for Streaming Speech Translation
The study presented in this work is a first effort at real-time speech translation of TED talks, a compendium of public talks with different speakers addressing a variety of topics, and demonstrates that a good segmentation is useful, and a novel conjunction-based segmentation strategy improves translation quality nearly as much as other strategies such as comma- based segmentation.
Automatic linguistic segmentation of conversational speech
  • A. Stolcke, Elizabeth Shriberg
  • Linguistics, Computer Science
    Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96
  • 1996
A simple automatic segmenter of transcripts based on N-gram language modeling that achieves 85% recall and 70% precision on linguistic boundary detection and study the relevance of several word-level features for segmentation performance.
Punctuation insertion for real-time spoken language translation
The successful integration of an attentional encoder-decoder-based segmentation and punctuation insertion model into a real-time spoken language translation system is shown and translation performance is improved by 1.3 BLEU points by adopting the NMT-based punctuation model, maintaining low-latency.
Untranscribed Web Audio for Low Resource Speech Recognition
A method to force the base model to overgenerate possible transcriptions, relying on the ability of LF-MMI to deal with uncertainty, outperforms the standard semisupervised method and yields significant gains when adapting for mismatched bandwidth and domain.
KIT’s IWSLT 2020 SLT Translation System
KIT’s submissions to the IWSLT2020 Speech Translation evaluation campaign are described, in which their simultaneous models are Transformer based and can be efficiently trained to obtain low latency with minimized compromise in quality.
End-to-End Speech-Translation with Knowledge Distillation: FBK@IWSLT2020
FBK’s participation in the IWSLT 2020 offline speech translation (ST) task evaluates systems’ ability to translate English TED talks audio into German texts is described, with an excellent result compared to recent papers.
Sequence-to-Sequence Models Can Directly Translate Foreign Speech
A recurrent encoder-decoder deep neural network architecture that directly translates speech in one language into text in another, illustrating the power of attention-based models.