Multilingual Speech Translation with Unified Transformer: Huawei Noah’s Ark Lab at IWSLT 2021

  title={Multilingual Speech Translation with Unified Transformer: Huawei Noah’s Ark Lab at IWSLT 2021},
  author={Xingshan Zeng and Liangyou Li and Qun Liu},
This paper describes the system submitted to the IWSLT 2021 Multilingual Speech Translation (MultiST) task from Huawei Noah’s Ark Lab. We use a unified transformer architecture for our MultiST model, so that the data from different modalities (i.e., speech and text) and different tasks (i.e., Speech Recognition, Machine Translation, and Speech Translation) can be exploited to enhance the model’s ability. Specifically, speech and text inputs are firstly fed to different feature extractors to… 

Figures and Tables from this paper

Multilingual Simultaneous Speech Translation
This work investigates multilingual models and different architectures on the ability to perform online speech translation and shows that the approach gen-eralizes to different architectures, and leads to smaller translation quality losses after adapting to the online model.
This paper describes each shared task, data and evaluation metrics, and reports results of the received submissions of the IWSLT 2021 evaluation campaign.


End-to-End Speech Translation with Knowledge Distillation
This paper proposes a knowledge distillation approach to improve end-to-end speech translation model by transferring the knowledge from text translation model throughknowledge distillation.
Fairseq S2T: Fast Speech-to-Text Modeling with Fairseq
State-of-the-art RNN-based as well as Transformer-based models and open-source detailed training recipes are implemented and seamlessly integrated into S2T workflows for multi-task learning or transfer learning.
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, finds that it is possible to achieve comparable accuracy to direct subword training from raw sentences.
Conv-Transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-End Speech Recognition
Conv-Transformer Transducer architecture, named Conv- transducer, achieves competitive performance on LibriSpeech dataset (3.6\% WER on test-clean) without external language models.
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.
Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Improving Sequence-To-Sequence Speech Recognition Training with On-The-Fly Data Augmentation
This paper examines the influence of three data augmentation methods on the performance of two S2S model architectures – a time perturbation in the frequency domain and sub-sequence sampling and their own development.
On Using SpecAugment for End-to-End Speech Translation
This work investigates a simple data augmentation technique, SpecAugment, for end-to-end speech translation by alleviating overfitting to some extent and shows that the method also leads to significant improvements in various data conditions irrespective of the amount of training data.
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks
This paper presents a novel method for training RNNs to label unsegmented sequences directly, thereby solving both problems of sequence learning and post-processing.
Audio augmentation for speech recognition
This paper investigates audio-level speech augmentation methods which directly process the raw signal, and presents results on 4 different LVCSR tasks with training data ranging from 100 hours to 1000 hours, to examine the effectiveness of audio augmentation in a variety of data scenarios.