Generative Spoken Dialogue Language Modeling

  title={Generative Spoken Dialogue Language Modeling},
  author={Tu Nguyen and Eugene Kharitonov and Jade Copet and Yossi Adi and Wei-Ning Hsu and Ali Mamdouh Elkahky and Paden Tomasello and Robin Algayres and Beno{\^i}t Sagot and Abdel-rahman Mohamed and Emmanuel Dupoux},
  journal={Transactions of the Association for Computational Linguistics},
Abstract We introduce dGSLM, the first “textless” model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised spoken unit discovery coupled with a dual-tower transformer architecture with cross-attention trained on 2000 hours of two-channel raw conversational audio (Fisher dataset) without any text or labels. We show that our model is able to generate speech, laughter, and other paralinguistic signals in the two channels simultaneously and… 

Self-Supervised Language Learning From Raw Audio: Lessons From the Zero Resource Speech Challenge

An overview of the six editions of the Zero Resource Speech Challenge series since 2015 is presented, the lessons learned are discussed, and the areas which need more work or give puzzling results are outlined.

Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision

SPEAR-TTS is introduced, a multi-speaker text-to-speech (TTS) system that can be trained with minimal supervision and achieves a character error rate that is competitive with state-of-the-art methods using only 15 minutes of parallel data.

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

A multi-modal AI system named AudioGPT is proposed, which complements LLMs with 1) foundation models to process complex audio information and solve numerous understanding and generation tasks; and 2) the input/output interface (ASR, TTS) to support spoken dialogue.

SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities

SpeechGPT is proposed, a large language model with intrinsic cross-modal conversational abilities, capable of perceiving and generating multi-model content and highlighting the potential of handling multiple modalities with one model.

AudioGen: Textually Guided Audio Generation

This work proposes AaudioGen, an auto-regressive generative model that generates audio samples conditioned on text inputs that outperforms over both objective and subjective metrics.

Speaking Style Conversion With Discrete Self-Supervised Units

This study introduces a method for converting not only the timbre, but also prosodic information (i.e., rhythm and pitch changes) to those of the target speaker through a pretrained, self-supervised, model for encoding speech to discrete units.

textless-lib: a Library for Textless Spoken Language Processing

This paper introduces textless-lib, a PyTorch-based library aimed to facilitate research in the textless setting, and describes the building blocks that the library provides and demonstrates its usability by discuss three different use-case examples.

Audio Language Modeling using Perceptually-Guided Discrete Representations

The quality of samples generated by the method is evaluated on Audioset, the largest dataset for general audio to date, and it is shown that it is superior to the evaluated baseline audio encoders.

Self-Supervised Speech Representation Learning: A Review

This review presents approaches for self-supervised speech representation learning and their connection to other research areas, and reviews recent efforts on benchmarking learned representations to extend the application beyond speech recognition.

Analysing Discrete Self Supervised Speech Representation for Spoken Language Modeling

  • Amitay SichermanYossi Adi
  • Computer Science
    ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2023
This work profoundly analyzes discrete self-supervised speech representations (units) through the eyes of Generative Spoken Language Modeling (GSLM) and proposes a new, unsupervised metric to measure unit redundancies.

On Generative Spoken Language Modeling from Raw Audio

Generative Spoken Language Modeling is introduced, the task of learning the acoustic and linguistic characteristics of a language from raw audio and a set of metrics to automatically evaluate the learned representations atoustic and linguistic levels for both encoding and generation.

DIALOGPT : Large-Scale Generative Pre-training for Conversational Response Generation

It is shown that conversational systems that leverage DialoGPT generate more relevant, contentful and context-consistent responses than strong baseline systems.

Text-Free Prosody-Aware Generative Spoken Language Modeling

Experimental results show that the pGSLM can utilize prosody to improve both prosody and content modeling, and also generate natural, meaningful, and coherent speech given a spoken prompt.

Generative Deep Neural Networks for Dialogue: A Short Review

Recently proposed models based on generative encoder-decoder neural network architectures are reviewed and it is shown that these models have better ability to incorporate long-term dialogue history, to model uncertainty and ambiguity in dialogue, and to generate responses with high-level compositional structure.

TurnGPT: a Transformer-based Language Model for Predicting Turn-taking in Spoken Dialog

This paper introduces TurnGPT, a transformer-based language model for predicting turn-shifts in spoken dialog and explores the model’s potential in not only detecting, but also projecting, turn-completions.

Towards a General, Continuous Model of Turn-taking in Spoken Dialogue using LSTM Recurrent Neural Networks

A predictive, continuous model of turn-taking using Long Short-Term Memory (LSTM) Recurrent Neural Networks (RNN) trained on human-human dialogue data to predict upcoming speech activity in a future time window is presented.

A Neural Conversational Model

A simple approach to conversational modeling which uses the recently proposed sequence to sequence framework, and is able to extract knowledge from both a domain specific dataset, and from a large, noisy, and general domain dataset of movie subtitles.

Neural Dialogue Context Online End-of-Turn Detection

This paper proposes a fully neural network based dialogue-context online end-of-turn detection method that can utilize long-range interactive information extracted from both speaker’s utterances and

AudioLM: a Language Modeling Approach to Audio Generation

This work uses the discretized activations of a masked language model pre-trained on audio to capture long-term structure and the discrete codes produced by a neural audio codec to achieve high-quality synthesis.