Wav2Pix: Speech-conditioned Face Generation Using Generative Adversarial Networks

@article{Duarte2019Wav2PixSF,
  title={Wav2Pix: Speech-conditioned Face Generation Using Generative Adversarial Networks},
  author={Amanda Duarte and Francisco Roldan and Miquel Tubau and Janna Escur and Santiago Pascual and Amaia Salvador and Eva Mohedano and Kevin McGuinness and Jordi Torres and Xavier Gir{\'o}-i-Nieto},
  journal={ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2019},
  pages={8633-8637}
}
Speech is a rich biometric signal that contains information about the identity, gender and emotional state of the speaker. [] Key Method Our model is trained in a self-supervised approach by exploiting the audio and visual signals naturally aligned in videos. With the purpose of training from video data, we present a novel dataset collected for this work, with high-quality videos of youtubers with notable expressiveness in both the speech and visual signals.

Figures from this paper

Speech2Face: Learning the Face Behind a Voice
TLDR
This paper designs and trains a deep neural network to perform the task of reconstructing a facial image of a person from a short audio recording of that person speaking, and evaluates and numerically quantify how these Speech2Face reconstructions, obtained directly from audio, resemble the true face images of the speakers.
Attention-based Residual Speech Portrait Model for Speech to Face Generation
TLDR
Evaluation on AVSpeech dataset shows that the proposed AR-SPM model accelerates the convergence of training, outperforms the state-of-the-art in terms of quality of the generated face, and achieves superior recognition accuracy of gender and age compared with the ground truth.
Speech2Talking-Face: Inferring and Driving a Face with Synchronized Audio-Visual Representation
TLDR
This work aims at not only inferring the face of a person but also animating it in a style-based generative framework and encourages better speech-identity correlation learning while generating vivid faces whose identities are consistent with given speech samples.
Ear2Face: Deep Biometric Modality Mapping
TLDR
An end-to-end deep neural network model that learns a mapping between the biometric modalities is presented, achieving a very high cross-modality person identification performance, for example, reaching 90.9% Rank-10 identification accuracy on the FERET dataset.
Facial expression GAN for voice-driven face generation
TLDR
This paper proposes a novel facial expression GAN (FE-GAN) which takes emotion and expressions into account in face generation and can not only outperform the previous models in terms of FID and IS values, but also generate more realistic face images compared with previous models.
Sound-to-Imagination: Unsupervised Crossmodal Translation Using Deep Dense Network Architecture
TLDR
Though the specified S2I translation problem is quite challenging, the author was able to generalize the translator model enough to obtain more than 14%, in average, of interpretable and semantically coherent images translated from unknown sounds.
Controlled AutoEncoders to Generate Faces from Voices
TLDR
This paper proposes a framework to morph a target face in response to a given voice in a way that facial features are implicitly guided by learned voice-face correlation and evaluates the framework on VoxCelab and VGGFace datasets through human subjects and face retrieval.
Text2FaceGAN: Face Generation from Fine Grained Textual Descriptions
TLDR
This paper generates captions for images in the CelebA dataset by creating an algorithm to automatically convert a list of attributes to a set of captions, and model the highly multi-modal problem of text to face generation as learning the conditional distribution of faces in same latent space.
Efficient, end-to-end and self-supervised methods for speech processing and generation
TLDR
This thesis proposes the use of recent pseudo-recurrent structures, like self-attention models and quasi- Recurrent networks, to build acoustic models for text-to-speech and proposes a problem-agnostic speech encoder, named PASE, which is a fully convolutional network that yields compact representations from speech waveforms.
Sound-to-Imagination: An Exploratory Study on Unsupervised Crossmodal Translation Using Diverse Audiovisual Data
TLDR
Despite the complexity of the specified S2I translation task, the model was able to generalize the model enough to obtain more than 14%, in average, of interpretable and semantically coherent images translated from unknown sounds.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 32 REFERENCES
SEGAN: Speech Enhancement Generative Adversarial Network
TLDR
This work proposes the use of generative adversarial networks for speech enhancement, and operates at the waveform level, training the model end-to-end, and incorporate 28 speakers and 40 different noise conditions into the same model, such that model parameters are shared across them.
Improved Speech Reconstruction from Silent Video
TLDR
This paper presents an end-to-end model based on a convolutional neural network for generating an intelligible and natural-sounding acoustic speech signal from silent video frames of a speaking person and shows promising results towards reconstructing speech from an unconstrained dictionary.
End-to-End Speech-Driven Facial Animation with Temporal GANs
TLDR
This work presents a system for generating videos of a talking head, using a still image of a person and an audio clip containing speech, that doesn't rely on any handcrafted intermediate features and is the first method capable of generating subject independent realistic videos directly from raw audio.
VoxCeleb: A Large-Scale Speaker Identification Dataset
TLDR
This paper proposes a fully automated pipeline based on computer vision techniques to create a large scale text-independent speaker identification dataset collected 'in the wild', and shows that a CNN based architecture obtains the best performance for both identification and verification.
You said that?
TLDR
An encoder-decoder CNN model is proposed that uses a joint embedding of the face and audio to generate synthesised talking face video frames and results of re-dubbing videos using speech from a different person are shown.
Deep Cross-Modal Audio-Visual Generation
TLDR
This paper uses conditional generative adversarial networks to achieve cross-modal audio-visual generation of musical performances and demonstrates that the model has the ability to generate one modality from the other modality, i.e., visual/audio, to a good extent.
Generative Adversarial Text to Image Synthesis
TLDR
A novel deep architecture and GAN formulation is developed to effectively bridge advances in text and image modeling, translating visual concepts from characters to pixels.
Audio-driven facial animation by joint end-to-end learning of pose and emotion
TLDR
This work presents a machine learning technique for driving 3D facial animation by audio input in real time and with low latency, and simultaneously discovers a compact, latent code that disambiguates the variations in facial expression that cannot be explained by the audio alone.
VGGFace2: A Dataset for Recognising Faces across Pose and Age
TLDR
A new large-scale face dataset named VGGFace2 is introduced, which contains 3.31 million images of 9131 subjects, with an average of 362.6 images for each subject, and the automated and manual filtering stages to ensure a high accuracy for the images of each identity are described.
Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
TLDR
This work introduces a class of CNNs called deep convolutional generative adversarial networks (DCGANs), that have certain architectural constraints, and demonstrates that they are a strong candidate for unsupervised learning.
...
1
2
3
4
...