The Multilingual TEDx Corpus for Speech Recognition and Translation

@article{Salesky2021TheMT,
  title={The Multilingual TEDx Corpus for Speech Recognition and Translation},
  author={Elizabeth Salesky and Matthew Wiesner and Jacob Bremerman and Roldano Cattoni and Matteo Negri and Marco Turchi and Douglas W. Oard and Matt Post},
  journal={ArXiv},
  year={2021},
  volume={abs/2102.01757}
}
We present the Multilingual TEDx corpus, built to support speech recognition (ASR) and speech translation (ST) research across many non-English source languages. The corpus is a collection of audio recordings from TEDx talks in 8 source languages. We segment transcripts into sentences and align them to the sourcelanguage audio and target-language translations. The corpus is released along with open-sourced code enabling extension to new talks and languages as they become available. Our corpus… Expand

Tables from this paper

Multilingual Speech Translation with Unified Transformer: Huawei Noah’s Ark Lab at IWSLT 2021
TLDR
This paper describes the system submitted to the IWSLT 2021 Multilingual Speech Translation (MultiST) task, which achieves significantly better results than bilingual baselines on supervised language pairs and yields reasonable results on zero-shot language pairs. Expand
Highland Puebla Nahuatl Speech Translation Corpus for Endangered Language Documentation
TLDR
An open-access speech translation corpus of Highland Puebla Nahuatl (glottocode high1278), an EL spoken in central Mexico, is presented and it is observed that state-of-the-art end-to-end ST models could outperform a cascaded ST (ASR > MT) pipeline when translating endangered language documentation materials. Expand
Pay Better Attention to Attention: Head Selection in Multilingual and Multi-Domain Sequence Modeling
TLDR
This paper finds that non-selective attention sharing is sub-optimal for achieving good generalization across all languages and domains, and proposes attention sharing strategies to facilitate parameter sharing and specialization in multilingual and multi-domain sequence modeling. Expand
Edinburgh’s End-to-End Multilingual Speech Translation System for IWSLT 2021
TLDR
Edinburgh's submissions to the IWSLT2021 multilingual speech translation (ST) task are described, with Edinburgh's end-to-end multilingual ST model based on Transformer built, integrating techniques including adaptive speech feature selection, language-specific modeling, multi-task learning, deep and big Transformer, sparsified linear attention and root mean square layer normalization. Expand
Multilingual Speech Translation KIT @ IWSLT2021
TLDR
The main approach is to develop both cascade and end-to-end systems and eventually combine them together to achieve the best possible results for this extremely low-resource setting. Expand
FST: the FAIR Speech Translation System for the IWSLT21 Multilingual Shared Task
TLDR
This paper describes an end-to-end multilingual speech translation system submitted to the IWSLT 2021 evaluation campaign on the Multilingual Speech Translation shared task, and shows Experimental results show the system outperforms the reported systems by a large margin. Expand
LeBenchmark: A Reproducible Framework for Assessing Self-Supervised Representation Learning from Speech
TLDR
Experiments show that SSL is beneficial for most but not all tasks which confirms the need for exhaustive and reliable benchmarks to evaluate its real impact, and proposes LeBenchmark: a reproducible framework for assessing SSL from speech. Expand
FINDINGS OF THE IWSLT 2021 EVALUATION CAMPAIGN
TLDR
This paper describes each shared task, data and evaluation metrics, and reports results of the received submissions of the IWSLT 2021 evaluation campaign. Expand
Maastricht University’s Multilingual Speech Translation System for IWSLT 2021
TLDR
Maastricht University’s participation in the IWSLT 2021 multilingual speech translation track is described, with an end-to-end model that performs both speech transcription and translation and an ensembling technique that consistently improves the quality of transcripts and translations. Expand
ON-TRAC’ systems for the IWSLT 2021 low-resource speech translation and multilingual speech translation shared tasks
This paper describes the ON-TRAC Consortium translation systems developed for two challenge tracks featured in the Evaluation Campaign of IWSLT 2021, low-resource speech translation and multilingualExpand
...
1
2
...

References

SHOWING 1-10 OF 51 REFERENCES
CoVoST 2: A Massively Multilingual Speech-to-Text Translation Corpus
TLDR
CoVoST 2 is released, a large-scale multilingual speech translation corpus covering translations from 21 languages into English and from English into 15 languages, which represents the largest open dataset available to date from total volume and language coverage perspective. Expand
Augmenting Librispeech with French Translations: A Multimodal Corpus for Direct Speech Translation Evaluation
TLDR
This paper augments an existing (monolingual) corpus: LibriSpeech with an existing corpus derived from read audiobooks from the LibriVox project, and shows that the automatic alignments scores are reasonably correlated with the human judgments of the bilingual alignment quality. Expand
MuST-C: a Multilingual Speech Translation Corpus
TLDR
MuST-C is created, a multilingual speech translation corpus whose size and quality will facilitate the training of end-to-end systems for SLT from English into 8 languages and an empirical verification of its quality and SLT results computed with a state-of-the-art approach on each language direction. Expand
Improved speech-to-text translation with the Fisher and Callhome Spanish-English speech translation corpus
TLDR
The Fisher and Callhome Spanish-English Speech Translation Corpus is introduced, supplementing existing LDC audio and transcripts with ASR 1-best, lattice, and oracle output produced by the Kaldi recognition system and English translations obtained on Amazon’s Mechanical Turk. Expand
Europarl-ST: A Multilingual Corpus for Speech Translation of Parliamentary Debates
TLDR
A novel multilingual SLT corpus containing paired audio-text samples for SLT from and into 6 European languages, for a total of 30 different translation directions, compiled using the debates held in the European Parliament between 2008 and 2012 is presented. Expand
MaSS: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible
TLDR
This article proposes to add multilingual links between speech segments in different languages, and shares a large and clean dataset of 8,130 parallel spoken utterances across 8 languages, which is named MaSS (Multilingual corpus of Sentence-aligned Spoken utterances). Expand
MuST-C: A multilingual corpus for end-to-end speech translation
TLDR
MuST-C, a large and freely available Multilingual Speech Translation Corpus built from English TED Talks, is presented, describing the corpus creation methodology and discussing the outcomes of empirical and manual quality evaluations. Expand
An Attentional Model for Speech Translation Without Transcription
TLDR
On the more challenging speech-to-word alignment task, the model nearly matches GIZA++’s performance on gold transcriptions, but without recourse to transcriptions or to a lexicon. Expand
Multilingual End-to-End Speech Translation
TLDR
It is experimentally confirmed that multilingual end-to-end ST models significantly outperform bilingual ones in both scenarios and the generalization of multilingual training is also evaluated in a transfer learning scenario to a very low-resource language pair. Expand
Exploring Phoneme-Level Speech Representations for End-to-End Speech Translation
TLDR
This work shows that a naive method to create compressed phoneme-like speech representations is far more effective and efficient for translation than traditional frame-level speech features. Expand
...
1
2
3
4
5
...