Speaker disentanglement in video-to-speech conversion
@article{Onea2021SpeakerDI, title={Speaker disentanglement in video-to-speech conversion}, author={Dan Oneaţă and Adriana Stan and Horia Cucu}, journal={2021 29th European Signal Processing Conference (EUSIPCO)}, year={2021}, pages={46-50} }
The task of video-to-speech aims to translate silent video of lip movement to its corresponding audio signal. Previous approaches to this task are generally limited to the case of a single speaker, but a method that accounts for multiple speakers is desirable as it allows to (i) leverage datasets with multiple speakers or few samples per speaker; and (ii) control speaker identity at inference time. In this paper, we introduce a new video-to-speech architecture and explore ways of extending it…
2 Citations
VCVTS: Multi-Speaker Video-to-Speech Synthesis Via Cross-Modal Knowledge Transfer from Voice Conversion
- PhysicsICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2022
Though significant progress has been made for speaker-dependent Video-to-Speech (VTS) synthesis, little attention is devoted to multi-speaker VTS that can map silent video to speech, while allowing…
SVTS: Scalable Video-to-Speech Synthesis
- Computer ScienceINTERSPEECH
- 2022
This work introduces a scalable video-to-speech framework consisting of two components: a video- to-spectrogram predictor and a pre-trained neural vocoder, which converts the mel-frequency spectrograms into waveform audio.
References
SHOWING 1-10 OF 24 REFERENCES
Lipper: Speaker Independent Speech Synthesis Using Multi-View Lipreading
- Computer ScienceAAAI
- 2019
Lipreading is the process of understanding and interpreting speech by observing a speaker’s lip movements and Lipper is a vocabulary and language agnostic, speaker independent and a near real-time model that deals with a variety of poses of a speaker.
Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis
- Computer Science2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2020
This work proposes a novel approach with key design choices to achieve accurate, natural lip to speech synthesis in such unconstrained scenarios for the first time and shows that its method is four times more intelligible than previous works in this space.
Disentangling Correlated Speaker and Noise for Speech Synthesis via Data Augmentation and Adversarial Factorization
- Computer ScienceICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
Experimental results demonstrate that the proposed method can disentangle speaker and noise attributes even if they are correlated in the training data, and can be used to consistently synthesize clean speech for all speakers.
Video-Driven Speech Reconstruction using Generative Adversarial Networks
- Computer ScienceINTERSPEECH
- 2019
This paper presents an end-to-end temporal model capable of directly synthesising audio from silent video, without needing to transform to-and-from intermediate features, based on GANs.
Harnessing AI for Speech Reconstruction using Multi-view Silent Video Feed
- Computer ScienceACM Multimedia
- 2018
Initial results confirm the usefulness of exploiting multiple camera views in building an efficient speech reading and reconstruction system and shows the optimal placement of cameras which would lead to the maximum intelligibility of speech.
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
- Computer ScienceNeurIPS
- 2018
It is shown that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.
Deep Voice 2: Multi-Speaker Neural Text-to-Speech
- Computer ScienceNIPS
- 2017
It is shown that a single neural TTS system can learn hundreds of unique voices from less than half an hour of data per speaker, while achieving high audio quality synthesis and preserving the speaker identities almost perfectly.
Utterance-level Aggregation for Speaker Recognition in the Wild
- Computer ScienceICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
This paper proposes a powerful speaker recognition deep network, using a ‘thin-ResNet’ trunk architecture, and a dictionary-based NetVLAD or GhostVLAD layer to aggregate features across time, that can be trained end-to-end.
Vid2speech: Speech reconstruction from silent video
- Computer Science2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2017
It is shown that by leveraging the automatic feature learning capabilities of a CNN, the model can obtain state-of-the-art word intelligibility on the GRID dataset, and show promising results for learning out- of-vocabulary (OOV) words.
Lip2Audspec: Speech Reconstruction from Silent Lip Movements Video
- Computer Science, Physics2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2018
A deep neural network, trained jointly on different speakers is able to extract individual speaker characteristics and gives promising results of reconstructing intelligible speech with superior word recognition accuracy.