• Corpus ID: 238856828

SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing

@inproceedings{Ao2022SpeechT5UE,
  title={SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing},
  author={Junyi Ao and Rui Wang and Long Zhou and Shujie Liu and Shuo Ren and Yu Wu and Tom Ko and Qing Li and Yu Zhang and Zhihua Wei and Yao Qian and Jinyu Li and Furu Wei},
  booktitle={ACL},
  year={2022}
}
Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning. The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After preprocessing the input speech/text through the pre-nets, the shared encoder-decoder network models the… 

Figures and Tables from this paper

Unified Speech-Text Pre-training for Speech Translation and Recognition
TLDR
Experiments show the proposed method can effectively fuse speech and text information into one model and achieves between 1.7 and 2.3 BLEU improvement above the state of the art on the MuST-C speech translation dataset and comparable WERs to wav2vec 2.0 on the Librispeech speech recognition task.
LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT
TLDR
A Transformer-based supernet that is nested with thousands of weight-sharing subnets and design a two-stage distillation strategy to leverage the contextualized latent representations from HuBERT to find the desired architectures automatically by pruning structured parameters.
The YiTrans Speech Translation System for IWSLT 2022 Offline Shared Task
TLDR
Experimental results show that the YiTrans system obtains a significant improvement than the strong baseline on three translation directions, and it achieves +5.2 BLEU improvements over last year’s optimal end-to-end system on tst2021 English-German.
Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data
TLDR
Two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes, derived from an offline clustering model are introduced, to predict the pseudo codes via masked language modeling in encoder output, like HuBERT model, while the other lets the decoder learn to reconstruct pseudo codes autoregressively instead of generating textual scripts.
A Complementary Joint Training Approach Using Unpaired Speech and Text for Low-Resource Automatic Speech Recognition
TLDR
This work proposes a complementary joint training (CJT) method that trains a model alternatively with two data pairs, and label masking for pseudo-labels and gradient restriction for synthesized audio are proposed to further cope with the deviations from real data.
Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation
TLDR
Self-supervised pre-training with unlabeled speech data and data augmentation for direct speech-to-speech translation models consistently improves model performance compared with multitask learning with a BLEU gain of 4.3-12.0.
MAESTRO: Matched Speech Text Representations through Modality Matching
TLDR
Maestro is a novel algorithm to learn unified representations from both speech and text modalities simultaneously that can transfer to diverse downstream tasks such as Automated Speech Recognition (ASR) and Speech Translation (ST).
Meta Learning for Natural Language Processing: A Survey
TLDR
The goal with this survey paper is to offer researchers pointers to relevant meta-learning works in NLP and attract more attention from the NLP community to drive future innovation.
Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages
TLDR
Wav2Seq is introduced, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data, and shows comparable performance to highly optimized recent methods on automatic speech recognition (ASR).
data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language
TLDR
Data2vec is a framework that uses the same learning method for either speech, NLP or computer vision to predict latent representations of the full input data based on a masked view of the input in a self-distillation setup using a standard Transformer architecture.
...
1
2
...

References

SHOWING 1-10 OF 84 REFERENCES
Pretraining Techniques for Sequence-to-Sequence Voice Conversion
TLDR
It is argued that VC models initialized with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
fairseq: A Fast, Extensible Toolkit for Sequence Modeling
TLDR
Fairseq is an open-source sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling, and other text generation tasks and supports distributed training across multiple GPUs and machines.
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
TLDR
The Hidden-Unit BERT (HuBERT) approach for self-supervised speech representation learning, which utilizes an offline clustering step to provide aligned target labels for a BERT-like prediction loss.
SUPERB: Speech processing Universal PERformance Benchmark
TLDR
A simple framework to solve SUPERB tasks by learning task-specialized lightweight prediction heads on top of the frozen shared model for its preferable re-usability and results demonstrate that the framework is promising as SSL representations show competitive generalizability and accessibility across SuperB tasks.
SpeechNet: A Universal Modularized Model for Speech Processing Tasks
TLDR
A universal modularized model, SpeechNet, which contains the five basic modules for speech processing, and will release the code and experimental settings to facilitate the research of modularized universal models or multi-task learning of speech processing tasks.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
TLDR
This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech
TLDR
Experimental results show that neural end-to-end TTS models trained from the LibriTTS corpus achieved above 4.0 in mean opinion scores in naturalness in five out of six evaluation speakers.
...
1
2
3
4
5
...