• Corpus ID: 238856828

SpeechT5: Unified-Modal Encoder-Decoder Pre-training for Spoken Language Processing

  title={SpeechT5: Unified-Modal Encoder-Decoder Pre-training for Spoken Language Processing},
  author={Junyi Ao and Rui Wang and Long Zhou and Shujie Liu and Shuo Ren and Yu Wu and Tom Ko and Qing Li and Yu Zhang and Zhihua Wei and Yao Qian and Jinyu Li and Furu Wei},
Motivated by the success of T5 (Text-ToText Transfer Transformer) in pre-training natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning. The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After preprocessing the speech/text input through the pre-nets, the shared encoder-decoder network models the… 

Figures and Tables from this paper

PARP: Prune, Adjust and Re-Prune for Self-Supervised Speech Recognition
This work proposes Prune-AdjustRe-Prune (PARP), which discovers and finetunes subnetworks for much better ASR performance, while only requiring a single downstream finetuning run, and demonstrates the computational advantage and performance gain of PARP over baseline pruning methods.
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing
A new pretrained model, WavLM, is proposed, to solve full-stack downstream speech tasks and achieves state-of-the-art performance on the SUPERB benchmark, and brings significant improvements for various speech processing tasks on their representative benchmarks.


SPLAT: Speech-Language Joint Pre-Training for Spoken Language Understanding
A novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules and improves the previous state-of-the-art performance on the Spoken SQuAD dataset by more than 10%.
Speech-Language Pre-Training for End-to-End Spoken Language Understanding
  • Yao Qian, Ximo Bian, +4 authors Michael Zeng
  • Computer Science, Engineering
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
The proposed unified speech-language pre-trained model (SLP) is continually enhanced on limited labeled data from a target domain by using a conditional masked language model (MLM) objective, and thus can effectively generate a sequence of intent, slot type, and slot value for given input speech in the inference.
Pre-Trained Text Embeddings for Enhanced Text-to-Speech Synthesis
It is hypothesized that the text embeddings contain information about the semantics of the phrase and the importance of each word, which should help TTS systems produce more natural prosody and pronunciation.
W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training
Motivated by the success of masked language modeling (MLM) in pre-training natural language processing models, we propose w2vBERT that explores MLM for self-supervised speech representation learning.
Pretraining Techniques for Sequence-to-Sequence Voice Conversion
It is argued that VC models initialized with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
Machine Speech Chain
To the best of the knowledge, this is the first deep learning framework that integrates human speech perception and production behaviors and significantly improved performance over that from separate systems that were only trained with labeled data.
Neural Speech Synthesis with Transformer Network
This paper introduces and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2, and achieves state-of-the-art performance and close to human quality.
Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech
The proposed Speech2Vec model, a novel deep neural network architecture for learning fixed-length vector representations of audio segments excised from a speech corpus, is based on a RNN Encoder-Decoder framework, and borrows the methodology of skipgrams or continuous bag-of-words for training.
Hybrid CTC/Attention Architecture for End-to-End Speech Recognition
The proposed hybrid CTC/attention end-to-end ASR is applied to two large-scale ASR benchmarks, and exhibits performance that is comparable to conventional DNN/HMM ASR systems based on the advantages of both multiobjective learning and joint decoding without linguistic resources.
SpeechNet: A Universal Modularized Model for Speech Processing Tasks
A universal modularized model, SpeechNet, which contains the five basic modules for speech processing, and will release the code and experimental settings to facilitate the research of modularized universal models or multi-task learning of speech processing tasks.