Audio-Driven Emotional Video Portraits

  title={Audio-Driven Emotional Video Portraits},
  author={Xinya Ji and Hang Zhou and Kaisiyuan Wang and Wayne Wu and Chen Change Loy and Xun Cao and Feng Xu},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  • Xinya Ji, Hang Zhou, Feng Xu
  • Published 15 April 2021
  • Computer Science
  • 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Despite previous success in generating audio-driven talking heads, most of the previous studies focus on the correlation between speech content and the mouth shape. Facial emotion, which is one of the most important features on natural human faces, is always neglected in their methods. In this work, we present Emotional Video Portraits (EVP), a system for synthesizing high-quality video portraits with vivid emotional dynamics driven by audios. Specifically, we propose the Cross-Reconstructed… 

Figures and Tables from this paper

FaceFormer: Speech-Driven 3D Facial Animation with Transformers
A Transformer-based autoregressive model, FaceFormer, is proposed which encodes the long-term audio context and autoregressively predicts a sequence of animated 3D face meshes and devise two biased attention mechanisms well suited to this specific task, including the biased cross-modal multi-head (MH) attention and the biased causal MH self-attention with a periodic positional encoding strategy.
Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation
This work presents a live system that generates personalized photorealistic talking-head animation only driven by audio signals at over 30 fps and synthesizes high-fidelity personalized facial details, e.g., wrinkles, teeth.
Neural Emotion Director: Speech-preserving semantic control of facial expressions in "in-the-wild" videos
This method is the first to be capable of controlling the actor’s facial expressions by even using as a sole input the semantic labels of the manipulated emotions, while at the same time preserving the speech-related lip movements.
Live speech portraits
This work presents a live system that generates personalized photorealistic talking-head animation only driven by audio signals at over 30 fps and demonstrates the superiority of the method over state-of-the-art techniques.
Neural Relighting and Expression Transfer On Video Portraits
A neural relighting and expression transfer technique to transfer the head pose and facial expressions from a source performer to a portrait video of a target performer while enabling dynamic relighting.
One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning
An Audio-Visual Correlation Transformer (AVCT) is developed that aims to infer talking motions represented by keypoint based dense motion fields from an input audio and can inherently generalize to audio spoken by other identities.
Imitating Arbitrary Talking Style for Realistic Audio-Driven Talking Face Synthesis
This paper systematically investigates talking styles with the collected Ted-HD dataset and constructs style codes as several statistics of 3D morphable model (3DMM) parameters, and devise a latent-style-fusion (LSF) model to synthesize stylized talking faces by imitating talking styles from the style codes.
Relightable Neural Video Portrait
Figure 1. We introduce a relightable neural video portrait scheme for simultaneous relighting and reenactment that transfers the head pose and facial expressions from a source actor to a portrait
AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis
Experimental results demonstrate that the novel framework can produce high-fidelity and natural results, and support free adjustment of audio signals, viewing directions, and background images.
Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation
This paper proposes a clean yet effective framework to generate pose-controllable talking faces whose poses are controllable by other videos and has multiple advanced capabilities including extreme view robustness and talking face frontalization.


MEAD: A Large-Scale Audio-Visual Dataset for Emotional Talking-Face Generation
The Multi-view Emotional Audio-visual Dataset (MEAD) is built, a talking-face video corpus featuring 60 actors and actresses talking with eight different emotions at three different intensity levels that could benefit a number of different research fields including conditional generation, cross-modal understanding and expression recognition.
Realistic Speech-Driven Facial Animation with GANs
This work presents an end-to-end system that generates videos of a talking head, using only a still image of a person and an audio clip containing speech, without relying on handcrafted intermediate features.
Audio-driven facial animation by joint end-to-end learning of pose and emotion
This work presents a machine learning technique for driving 3D facial animation by audio input in real time and with low latency, and simultaneously discovers a compact, latent code that disambiguates the variations in facial expression that cannot be explained by the audio alone.
MakeItTalk: Speaker-Aware Talking Head Animation
A method that generates expressive talking heads from a single facial image with audio as the only input that is able to synthesize photorealistic videos of entire talking heads with full range of motion and also animate artistic paintings, sketches, 2D cartoon characters, Japanese mangas, stylized caricatures in a single unified framework.
Animating Face using Disentangled Audio Representations
This work proposes an explicit audio representation learning framework that disentangles audio sequences into various factors such as phonetic content, emotional tone, background noise and others and demonstrates that when conditioned on disentangled content representation, the generated mouth movement by the model is significantly more accurate than previous approaches in the presence of noise and emotional variations.
Deep video portraits
The first to transfer the full 3D head position, head rotation, face expression, eye gaze, and eye blinking from a source actor to a portrait video of a target actor using only an input video is presented.
Hierarchical Cross-Modal Talking Face Generation With Dynamic Pixel-Wise Loss
A cascade GAN approach to generate talking face video, which is robust to different face shapes, view angles, facial characteristics, and noisy audio conditions, and compared to a direct audio-to-image approach, this approach avoids fitting spurious correlations between audiovisual signals that are irrelevant to the speech content.
Neural Voice Puppetry: Audio-driven Facial Reenactment
This work presents Neural Voice Puppetry, a novel approach for audio-driven facial video synthesis that generalizes across different people, allowing it to synthesize videos of a target actor with the voice of any unknown source actor or even synthetic voices that can be generated utilizing standard text-to-speech approaches.
Arbitrary Talking Face Generation via Attentional Audio-Visual Coherence Learning
A novel arbitrary talking face generation framework is proposed by discovering the audio-visual coherence via the proposed Asymmetric Mutual Information Estimator (AMIE) and a Dynamic Attention (DA) block by selectively focusing the lip area of the input image during the training stage, to further enhance lip synchronization.
Talking Face Generation by Adversarially Disentangled Audio-Visual Representation
This work finds that the talking face sequence is actually a composition of both subject- related information and speech-related information, and learns disentangled audio-visual representation, which has an advantage where both audio and video can serve as inputs for generation.