Wav2Pix: Speech-conditioned Face Generation Using Generative Adversarial Networks
@article{Duarte2019Wav2PixSF, title={Wav2Pix: Speech-conditioned Face Generation Using Generative Adversarial Networks}, author={Amanda Duarte and Francisco Roldan and Miquel Tubau and Janna Escur and Santiago Pascual and Amaia Salvador and Eva Mohedano and Kevin McGuinness and Jordi Torres and Xavier Gir{\'o}-i-Nieto}, journal={ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, year={2019}, pages={8633-8637} }
Speech is a rich biometric signal that contains information about the identity, gender and emotional state of the speaker. [] Key Method Our model is trained in a self-supervised approach by exploiting the audio and visual signals naturally aligned in videos. With the purpose of training from video data, we present a novel dataset collected for this work, with high-quality videos of youtubers with notable expressiveness in both the speech and visual signals.
46 Citations
Speech2Face: Learning the Face Behind a Voice
- Computer Science2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2019
This paper designs and trains a deep neural network to perform the task of reconstructing a facial image of a person from a short audio recording of that person speaking, and evaluates and numerically quantify how these Speech2Face reconstructions, obtained directly from audio, resemble the true face images of the speakers.
Attention-based Residual Speech Portrait Model for Speech to Face Generation
- Computer ScienceArXiv
- 2020
Evaluation on AVSpeech dataset shows that the proposed AR-SPM model accelerates the convergence of training, outperforms the state-of-the-art in terms of quality of the generated face, and achieves superior recognition accuracy of gender and age compared with the ground truth.
Speech2Talking-Face: Inferring and Driving a Face with Synchronized Audio-Visual Representation
- Computer ScienceIJCAI
- 2021
This work aims at not only inferring the face of a person but also animating it in a style-based generative framework and encourages better speech-identity correlation learning while generating vivid faces whose identities are consistent with given speech samples.
Ear2Face: Deep Biometric Modality Mapping
- Computer ScienceArXiv
- 2020
An end-to-end deep neural network model that learns a mapping between the biometric modalities is presented, achieving a very high cross-modality person identification performance, for example, reaching 90.9% Rank-10 identification accuracy on the FERET dataset.
Facial expression GAN for voice-driven face generation
- Computer ScienceVis. Comput.
- 2022
This paper proposes a novel facial expression GAN (FE-GAN) which takes emotion and expressions into account in face generation and can not only outperform the previous models in terms of FID and IS values, but also generate more realistic face images compared with previous models.
Sound-to-Imagination: Unsupervised Crossmodal Translation Using Deep Dense Network Architecture
- Computer ScienceArXiv
- 2021
Though the specified S2I translation problem is quite challenging, the author was able to generalize the translator model enough to obtain more than 14%, in average, of interpretable and semantically coherent images translated from unknown sounds.
Controlled AutoEncoders to Generate Faces from Voices
- Computer ScienceISVC
- 2020
This paper proposes a framework to morph a target face in response to a given voice in a way that facial features are implicitly guided by learned voice-face correlation and evaluates the framework on VoxCelab and VGGFace datasets through human subjects and face retrieval.
Text2FaceGAN: Face Generation from Fine Grained Textual Descriptions
- Computer Science2019 IEEE Fifth International Conference on Multimedia Big Data (BigMM)
- 2019
This paper generates captions for images in the CelebA dataset by creating an algorithm to automatically convert a list of attributes to a set of captions, and model the highly multi-modal problem of text to face generation as learning the conditional distribution of faces in same latent space.
Efficient, end-to-end and self-supervised methods for speech processing and generation
- Computer Science
- 2020
This thesis proposes the use of recent pseudo-recurrent structures, like self-attention models and quasi- Recurrent networks, to build acoustic models for text-to-speech and proposes a problem-agnostic speech encoder, named PASE, which is a fully convolutional network that yields compact representations from speech waveforms.
Sound-to-Imagination: An Exploratory Study on Unsupervised Crossmodal Translation Using Diverse Audiovisual Data
- Computer Science
- 2021
Despite the complexity of the specified S2I translation task, the model was able to generalize the model enough to obtain more than 14%, in average, of interpretable and semantically coherent images translated from unknown sounds.
References
SHOWING 1-10 OF 32 REFERENCES
SEGAN: Speech Enhancement Generative Adversarial Network
- Computer ScienceINTERSPEECH
- 2017
This work proposes the use of generative adversarial networks for speech enhancement, and operates at the waveform level, training the model end-to-end, and incorporate 28 speakers and 40 different noise conditions into the same model, such that model parameters are shared across them.
Improved Speech Reconstruction from Silent Video
- Computer Science2017 IEEE International Conference on Computer Vision Workshops (ICCVW)
- 2017
This paper presents an end-to-end model based on a convolutional neural network for generating an intelligible and natural-sounding acoustic speech signal from silent video frames of a speaking person and shows promising results towards reconstructing speech from an unconstrained dictionary.
End-to-End Speech-Driven Facial Animation with Temporal GANs
- Computer ScienceBMVC
- 2018
This work presents a system for generating videos of a talking head, using a still image of a person and an audio clip containing speech, that doesn't rely on any handcrafted intermediate features and is the first method capable of generating subject independent realistic videos directly from raw audio.
VoxCeleb: A Large-Scale Speaker Identification Dataset
- Computer ScienceINTERSPEECH
- 2017
This paper proposes a fully automated pipeline based on computer vision techniques to create a large scale text-independent speaker identification dataset collected 'in the wild', and shows that a CNN based architecture obtains the best performance for both identification and verification.
You said that?
- Computer ScienceBMVC
- 2017
An encoder-decoder CNN model is proposed that uses a joint embedding of the face and audio to generate synthesised talking face video frames and results of re-dubbing videos using speech from a different person are shown.
Deep Cross-Modal Audio-Visual Generation
- Computer ScienceACM Multimedia
- 2017
This paper uses conditional generative adversarial networks to achieve cross-modal audio-visual generation of musical performances and demonstrates that the model has the ability to generate one modality from the other modality, i.e., visual/audio, to a good extent.
Generative Adversarial Text to Image Synthesis
- Computer ScienceICML
- 2016
A novel deep architecture and GAN formulation is developed to effectively bridge advances in text and image modeling, translating visual concepts from characters to pixels.
Audio-driven facial animation by joint end-to-end learning of pose and emotion
- Computer ScienceACM Trans. Graph.
- 2017
This work presents a machine learning technique for driving 3D facial animation by audio input in real time and with low latency, and simultaneously discovers a compact, latent code that disambiguates the variations in facial expression that cannot be explained by the audio alone.
VGGFace2: A Dataset for Recognising Faces across Pose and Age
- Computer Science2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018)
- 2018
A new large-scale face dataset named VGGFace2 is introduced, which contains 3.31 million images of 9131 subjects, with an average of 362.6 images for each subject, and the automated and manual filtering stages to ensure a high accuracy for the images of each identity are described.
Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
- Computer ScienceICLR
- 2016
This work introduces a class of CNNs called deep convolutional generative adversarial networks (DCGANs), that have certain architectural constraints, and demonstrates that they are a strong candidate for unsupervised learning.