Cascaded Models with Cyclic Feedback for Direct Speech Translation

  title={Cascaded Models with Cyclic Feedback for Direct Speech Translation},
  author={Tsz Kin Lam and Shigehiko Schamoni and Stefan Riezler},
  journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
Direct speech translation describes a scenario where only speech inputs and corresponding translations are available. Such data are notoriously limited. We present a technique that allows cascades of automatic speech recognition (ASR) and machine translation (MT) to exploit in-domain direct speech translation data in addition to out-of-domain MT and ASR data. After pre-training MT and ASR, we use a feed-back cycle where the downstream performance of the MT system is used as a signal to improve… 

Figures and Tables from this paper

Sample, Translate, Recombine: Leveraging Audio Alignments for Data Augmentation in End-to-end Speech Translation
A novel approach to data augmentation that leverages audio alignments, linguistic properties, and translation is presented that delivers consistent improvements of up to 0.9 and 1.1 BLEU points on top of augmentation with knowledge distillation on fivelanguage pairs on CoVoST 2 and on two language pairs on Europarl-ST.
Non-Parametric Domain Adaptation for End-to-End Speech Translation
A novel non-parametric method that leverages domain-specific text translation corpus to achieve domain adaptation for the E2E-ST system is proposed and demonstrates that when in-domain text translation data is involved only, this approach improves baseline by 12.82 BLEU on average.
STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation
The Speech-TExt Manifold Mixup (STEMM) method effectively alleviates the cross-modal representation discrepancy, and achieves significant improvements over a strong baseline on eight translation directions.


Attention-Passing Models for Robust and Data-Efficient End-to-End Speech Translation
This paper demonstrates that direct speech translation models require more data to perform well and is able to exploit auxiliary training data much more effectively than direct attentional models, and proposes an attention-passing technique that alleviates error propagation issues in a previous formulation of a model with two attention stages.
Harnessing Indirect Training Data for End-to-End Automatic Speech Translation: Tricks of the Trade
Simple data augmentation by translating ASR transcripts proves most effective on the English–French augmented LibriSpeech dataset, closing the performance gap from 8.2 to 1.4 BLEU, compared to a very strong cascade that could directly utilize copious ASR and MT data.
Sequence-to-Sequence Models Can Directly Translate Foreign Speech
A recurrent encoder-decoder deep neural network architecture that directly translates speech in one language into text in another, illustrating the power of attention-based models.
Leveraging Weakly Supervised Data to Improve End-to-end Speech-to-text Translation
  • Ye Jia, Melvin Johnson, Yonghui Wu
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
It is demonstrated that a high quality end-to-end ST model can be trained using only weakly supervised datasets, and that synthetic data sourced from unlabeled monolingual text or speech can be used to improve performance.
End-to-End Automatic Speech Translation of Audiobooks
Experimental results show that it is possible to train compact and efficient end-to-end speech translation models in this setup and hope that the speech translation baseline on this corpus will be challenged in the future.
Augmenting Translation Models with Simulated Acoustic Confusions for Improved Spoken Language Translation
This work proposes a novel technique for adapting text-based statistical machine translation to deal with input from automatic speech recognition in spoken language translation tasks, and finds consistent and significant improvements in translation quality.
ASR Error Correction and Domain Adaptation Using Machine Translation
This work proposes a simple technique to perform domain adaptation for ASR error correction via machine translation, and uses two off-the-shelf ASR systems: Google ASR (commercial) and the ASPIRE model (open-source).
A Comparative Study on End-to-End Speech to Text Translation
An overview of different end-to-end architectures, as well as the usage of an auxiliary connectionist temporal classification (CTC) loss for better convergence, is provided.
A Unified Approach in Speech-to-Speech Translation: Integrating Features of Speech recognition and Machine Translation
The experimental results have shown significant improvement over the baseline IBM model 4 in all automatic translation evaluation metrics, including BLEU, NIST, multiple reference word error rate and its position independent counterpart.
Why word error rate is not a good metric for speech recognizer training for the speech translation task?
  • Xiaodong He, L. Deng, A. Acero
  • Computer Science
    2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2011
It is suggested that the speech recognizer component of the full ST system should be optimized by translation metrics instead of the traditional WER, and BLEU-oriented global optimization of ASR system parameters improves the translation quality by an absolute 1.5% BLEu score.