Synthesizing Obama

  title={Synthesizing Obama},
  author={Supasorn Suwajanakorn and Steven M. Seitz and Ira Kemelmacher-Shlizerman},
  journal={ACM Transactions on Graphics (TOG)},
  pages={1 - 13}
Given audio of President Barack Obama, we synthesize a high quality video of him speaking with accurate lip sync, composited into a target video clip. Trained on many hours of his weekly address footage, a recurrent neural network learns the mapping from raw audio features to mouth shapes. Given the mouth shape at each time instant, we synthesize high quality mouth texture, and composite it with proper 3D pose matching to change what he appears to be saying in a target video to match the input… 

ObamaNet: Photo-realistic lip-sync from text

The first architecture that generates both audio and synchronized photo-realistic lip-sync videos from any new text is presented, and it is claimed that this architecture is the first to be composed of fully trainable neural modules.

Photorealistic Lip Sync with Adversarial Temporal Convolutional Networks

This paper presents a novel lip-sync solution for producing a high-quality and photorealistic talking head from speech and proposes an image-to-image translation-based approach for generating high-resolution photoreal face appearance from synthetic facial maps.

Neural Voice Puppetry: Audio-driven Facial Reenactment

This work presents Neural Voice Puppetry, a novel approach for audio-driven facial video synthesis that generalizes across different people, allowing it to synthesize videos of a target actor with the voice of any unknown source actor or even synthetic voices that can be generated utilizing standard text-to-speech approaches.

Fine-grained talking face generation with video reinterpretation

This work proposes a coarse-to-fine tree-like architecture for synthesizing realistic talking face frames directly from audio clips that can generate fine-grained talking face videos that are not only synchronous with the input audios but also maintain visual details from the input face images.

Everybody’s Talkin’: Let Me Talk as You Want

A method to edit a target portrait footage by taking a sequence of audio as input to synthesize a photo-realistic video, which is end-to-end learnable and robust to voice variations in the source audio.

Speech-Driven Facial Reenactment Using Conditional Generative Adversarial Networks

A novel approach to generating photo-realistic images of a face with accurate lip sync, given an audio input by using a recurrent neural network and the power of conditional generative adversarial networks to produce highly- realistic face conditioned on a set of landmarks.

You Said That?: Synthesising Talking Faces from Audio

An encoder–decoder convolutional neural network model is developed that uses a joint embedding of the face and audio to generate synthesised talking face video frames and proposed methods to re-dub videos by visually blending the generated face into the source video frame using a multi-stream CNN model.

HeadGAN: Video-and-Audio-Driven Talking Head Synthesis

HeadGAN is proposed, a novel reenactment approach that conditions synthesis on 3D face representations, which can be extracted from any driving video and adapted to the facial geometry of any source.

Live speech portraits

This work presents a live system that generates personalized photorealistic talking-head animation only driven by audio signals at over 30 fps and demonstrates the superiority of the method over state-of-the-art techniques.


This work presents a method that generates expressive talking-head videos from a single facial image with audio as the only input, and first disentangles the content and speaker information in the input audio signal.



Trainable videorealistic speech animation

  • T. EzzatG. GeigerT. Poggio
  • Computer Science
    Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings.
  • 2004
This work describes how to create with machine learning techniques a generative, videorealistic, and speech animation module that looks like a video camera recording of the subject.

VDub: Modifying Face Video of Actors for Plausible Visual Alignment to a Dubbed Audio Track

This paper builds on high‐quality monocular capture of 3D facial performance, lighting and albedo of the dubbing and target actors, and uses audio analysis in combination with a space‐time retrieval method to synthesize a new photo‐realistically rendered and highly detailed 3D shape model of the mouth region to replace the target performance.

A deep bidirectional LSTM approach for video-realistic talking head

Experimental results show that the proposed DBLSTM approach outperforms the existing HMM-based approach in both objective and subjective evaluations.

Synthesizing photo-real talking head via trajectory-guided sample selection

This system renders a smooth and natural video of articulators in sync with given speech signals and won the FIRST place in the Audio-Visual match contest in LIPS2009 Challenge, which was perceptually evaluated by recruited human subjects.

Expressive Visual Text-to-Speech Using Active Appearance Models

This paper presents a complete system for expressive visual text-to-speech (VTTS), which is capable of producing expressive output, in the form of a 'talking head', given an input text and a set of

What Makes Tom Hanks Look Like Tom Hanks

This work reconstructs a controllable model of a person from a large photo collection that captures his or her persona, i.e., physical appearance and behavior, and shows the ability to drive or puppeteer the captured person B using any other video of a different person A.

Video Rewrite: driving visual speech with audio

Video Rewrite is the first facial-animation system to automate all the labeling and assembly tasks required to resync existing footage to a new soundtrack.

A new language independent, photo-realistic talking head driven by voice only

Subjective experiments show that lip motions thus rendered for 15 non-English languages are highly synchronized with the audio input and photo-realistic to human eyes perceptually.

Photo-real talking head with deep bidirectional LSTM

This paper proposes to use deep bidirectional LSTM (BLSTM) for audio/visual modeling in the authors' photo-real talking head system and finds the best network is two BLSTM layers sitting on top of one feed-forward layer on their datasets.

Talking heads synthesis from audio with deep neural networks

The method is proposed to use lower level audio features than phonemes and it enables to synthesize talking heads with expressions while existing researches which use phoneme as audio features only can synthesize neutral expression talking heads.