Speaker disentanglement in video-to-speech conversion

  title={Speaker disentanglement in video-to-speech conversion},
  author={Dan Oneaţă and Adriana Stan and Horia Cucu},
  journal={2021 29th European Signal Processing Conference (EUSIPCO)},
The task of video-to-speech aims to translate silent video of lip movement to its corresponding audio signal. Previous approaches to this task are generally limited to the case of a single speaker, but a method that accounts for multiple speakers is desirable as it allows to (i) leverage datasets with multiple speakers or few samples per speaker; and (ii) control speaker identity at inference time. In this paper, we introduce a new video-to-speech architecture and explore ways of extending it… 

Figures and Tables from this paper

VCVTS: Multi-Speaker Video-to-Speech Synthesis Via Cross-Modal Knowledge Transfer from Voice Conversion

Though significant progress has been made for speaker-dependent Video-to-Speech (VTS) synthesis, little attention is devoted to multi-speaker VTS that can map silent video to speech, while allowing

SVTS: Scalable Video-to-Speech Synthesis

This work introduces a scalable video-to-speech framework consisting of two components: a video- to-spectrogram predictor and a pre-trained neural vocoder, which converts the mel-frequency spectrograms into waveform audio.



Lipper: Speaker Independent Speech Synthesis Using Multi-View Lipreading

Lipreading is the process of understanding and interpreting speech by observing a speaker’s lip movements and Lipper is a vocabulary and language agnostic, speaker independent and a near real-time model that deals with a variety of poses of a speaker.

Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis

This work proposes a novel approach with key design choices to achieve accurate, natural lip to speech synthesis in such unconstrained scenarios for the first time and shows that its method is four times more intelligible than previous works in this space.

Disentangling Correlated Speaker and Noise for Speech Synthesis via Data Augmentation and Adversarial Factorization

Experimental results demonstrate that the proposed method can disentangle speaker and noise attributes even if they are correlated in the training data, and can be used to consistently synthesize clean speech for all speakers.

Video-Driven Speech Reconstruction using Generative Adversarial Networks

This paper presents an end-to-end temporal model capable of directly synthesising audio from silent video, without needing to transform to-and-from intermediate features, based on GANs.

Harnessing AI for Speech Reconstruction using Multi-view Silent Video Feed

Initial results confirm the usefulness of exploiting multiple camera views in building an efficient speech reading and reconstruction system and shows the optimal placement of cameras which would lead to the maximum intelligibility of speech.

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

It is shown that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

It is shown that a single neural TTS system can learn hundreds of unique voices from less than half an hour of data per speaker, while achieving high audio quality synthesis and preserving the speaker identities almost perfectly.

Utterance-level Aggregation for Speaker Recognition in the Wild

This paper proposes a powerful speaker recognition deep network, using a ‘thin-ResNet’ trunk architecture, and a dictionary-based NetVLAD or GhostVLAD layer to aggregate features across time, that can be trained end-to-end.

Vid2speech: Speech reconstruction from silent video

  • A. EphratShmuel Peleg
  • Computer Science
    2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2017
It is shown that by leveraging the automatic feature learning capabilities of a CNN, the model can obtain state-of-the-art word intelligibility on the GRID dataset, and show promising results for learning out- of-vocabulary (OOV) words.

Lip2Audspec: Speech Reconstruction from Silent Lip Movements Video

A deep neural network, trained jointly on different speakers is able to extract individual speaker characteristics and gives promising results of reconstructing intelligible speech with superior word recognition accuracy.