On Generative Spoken Language Modeling from Raw Audio

  title={On Generative Spoken Language Modeling from Raw Audio},
  author={Kushal Lakhotia and Evgeny Kharitonov and Wei-Ning Hsu and Yossi Adi and Adam Polyak and Benjamin Bolte and Tu Nguyen and Jade Copet and Alexei Baevski and Adel Ben Mohamed and Emmanuel Dupoux},
  journal={Transactions of the Association for Computational Linguistics},
Abstract We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudo- text), and a speech decoder… 

Are discrete units necessary for Spoken Language Modeling?

This work shows that discretization is indeed essential for good results in spoken language modeling, and removes linguistically irrelevant information from the continuous features, helping to improve language modeling performances.

The Zero Resource Speech Challenge 2021: Spoken Language Modelling

This work provides a pipeline baseline system consisting on an encoder based on contrastive predictive coding (CPC), a quantizer and a standard language model (BERT or LSTM) and discusses the main results.

The Interspeech Zero Resource Speech Challenge 2021: Spoken language modelling

This work provides a pipeline baseline system consisting on an encoder based on contrastive predictive coding (CPC), a quantizer (k-means) and a standard language model (BERT or LSTM) and discusses the main results.

Text-Free Prosody-Aware Generative Spoken Language Modeling

Experimental results show that the pGSLM can utilize prosody to improve both prosody and content modeling, and also generate natural, meaningful, and coherent speech given a spoken prompt.

Towards Language Modelling in the Speech Domain Using Sub-word Linguistic Units

A novel LSTM-based generative speech LM that is inspired by the CBOW model and built on linguistic units including syllables and phonemes is proposed, which offers better acoustic consistency across utterances in the dataset.

An Exploration of Prompt Tuning on Generative Spoken Language Model for Speech Processing Tasks

Experimental results show that the prompt tuning technique achieves competitive performance in speech classification tasks with fewer trainable parameters than fine-tuning specialized downstream models.

Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages

Wav2Seq is introduced, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data, and shows comparable performance to highly optimized recent methods on automatic speech recognition (ASR).

Direct Speech-to-Speech Translation With Discrete Units

A direct speech-to-speech translation model that translates speech from one language to speech in another language without relying on intermediate text generation is presented and is comparable to models that predict spectrograms and are trained with text supervision.

SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing

Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification.

Textless Speech-to-Speech Translation on Real Data

To the knowledge, this work is the first to establish a textless S2ST technique that can be trained with real-world data and works for multiple language pairs, and finetunes a pre-trained speech encoder with paired audios from multiple speakers and a single reference speaker to reduce the variations due to accents.



The Zero Resource Speech Benchmark 2021: Metrics and baselines for unsupervised spoken language modeling

We introduce a new unsupervised task, spoken language modeling: the learning of linguistic representations from raw audio signals without any labels, along with the Zero Resource Speech Benchmark

Semi-Supervised Spoken Language Understanding via Self-Supervised Speech and Language Model Pretraining

Experiments on ATIS show that the SLU framework with speech as input can perform on par with those using oracle text as input in semantics understanding, even though environmental noise is present and a limited amount of labeled semantics data is available for training.

Text-Free Image-to-Speech Synthesis Using Learned Segmental Units

This paper connects the image captioning module and the speech synthesis module with a set of discrete, sub-word speech units that are discovered with a self-supervised visual grounding task, and finds that the representation must satisfy several important properties to serve as drop-in replacements for text.

Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks

It is shown that decoupled speaker conditioning intrinsically improves discrete acoustic representations, yielding competitive synthesis quality compared to the challenge baseline.

The Zero Resource Speech Challenge 2019: TTS without T

We present the Zero Resource Speech Challenge 2019, which proposes to build a speech synthesizer without any text or phonetic labels: hence, TTS without T (text-to-speech without text). We provide

Deep Contextualized Acoustic Representations for Semi-Supervised Speech Recognition

This work first exploits a large amount of unlabeled audio data via representation learning, where it reconstructs a temporal slice of filterbank features from past and future context frames to train a CTC-based end-to-end ASR system using a smaller amount of labeled audio data.

Unsupervised End-to-End Learning of Discrete Linguistic Units for Voice Conversion

An unsupervised end-to-end training scheme where discrete subword units from speech are discovered without using any labels and the approach offers strong VC results as it eliminates speaker identity while preserving content within speech.

A Nonparametric Bayesian Approach to Acoustic Model Discovery

An unsupervised model is presented that simultaneously segments the speech, discovers a proper set of sub-word units and learns a Hidden Markov Model for each induced acoustic unit and outperforms a language-mismatched acoustic model.

WaveNet: A Generative Model for Raw Audio

WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.

Vector-quantized neural networks for acoustic unit discovery in the ZeroSpeech 2020 challenge

Two neural models are proposed to tackle the challenge of discrete representations of speech that separate phonetic content from speaker-specific details, using vector quantization to map continuous features to a finite set of codes.