Audio-Driven Emotional Video Portraits

@article{Ji2021AudioDrivenEV,
  title={Audio-Driven Emotional Video Portraits},
  author={Xinya Ji and Hang Zhou and Kaisiyuan Wang and Wayne Wu and Chen Change Loy and Xun Cao and Feng Xu},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2021},
  pages={14075-14084}
}
  • Xinya Ji, Hang Zhou, Feng Xu
  • Published 15 April 2021
  • Computer Science
  • 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Despite previous success in generating audio-driven talking heads, most of the previous studies focus on the correlation between speech content and the mouth shape. Facial emotion, which is one of the most important features on natural human faces, is always neglected in their methods. In this work, we present Emotional Video Portraits (EVP), a system for synthesizing high-quality video portraits with vivid emotional dynamics driven by audios. Specifically, we propose the Cross-Reconstructed… 

Figures and Tables from this paper

FaceFormer: Speech-Driven 3D Facial Animation with Transformers
TLDR
A Transformer-based autoregressive model, FaceFormer, is proposed which encodes the long-term audio context and autoregressively predicts a sequence of animated 3D face meshes and devise two biased attention mechanisms well suited to this specific task, including the biased cross-modal multi-head (MH) attention and the biased causal MH self-attention with a periodic positional encoding strategy.
Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation
TLDR
This work presents a live system that generates personalized photorealistic talking-head animation only driven by audio signals at over 30 fps and synthesizes high-fidelity personalized facial details, e.g., wrinkles, teeth.
Neural Emotion Director: Speech-preserving semantic control of facial expressions in "in-the-wild" videos
TLDR
This method is the first to be capable of controlling the actor’s facial expressions by even using as a sole input the semantic labels of the manipulated emotions, while at the same time preserving the speech-related lip movements.
Live speech portraits
TLDR
This work presents a live system that generates personalized photorealistic talking-head animation only driven by audio signals at over 30 fps and demonstrates the superiority of the method over state-of-the-art techniques.
Neural Relighting and Expression Transfer On Video Portraits
TLDR
A neural relighting and expression transfer technique to transfer the head pose and facial expressions from a source performer to a portrait video of a target performer while enabling dynamic relighting.
One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning
TLDR
An Audio-Visual Correlation Transformer (AVCT) is developed that aims to infer talking motions represented by keypoint based dense motion fields from an input audio and can inherently generalize to audio spoken by other identities.
Imitating Arbitrary Talking Style for Realistic Audio-Driven Talking Face Synthesis
TLDR
This paper systematically investigates talking styles with the collected Ted-HD dataset and constructs style codes as several statistics of 3D morphable model (3DMM) parameters, and devise a latent-style-fusion (LSF) model to synthesize stylized talking faces by imitating talking styles from the style codes.
Relightable Neural Video Portrait
Figure 1. We introduce a relightable neural video portrait scheme for simultaneous relighting and reenactment that transfers the head pose and facial expressions from a source actor to a portrait
AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis
TLDR
Experimental results demonstrate that the novel framework can produce high-fidelity and natural results, and support free adjustment of audio signals, viewing directions, and background images.
Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation
TLDR
This paper proposes a clean yet effective framework to generate pose-controllable talking faces whose poses are controllable by other videos and has multiple advanced capabilities including extreme view robustness and talking face frontalization.
...
...

References

SHOWING 1-10 OF 52 REFERENCES
MEAD: A Large-Scale Audio-Visual Dataset for Emotional Talking-Face Generation
TLDR
The Multi-view Emotional Audio-visual Dataset (MEAD) is built, a talking-face video corpus featuring 60 actors and actresses talking with eight different emotions at three different intensity levels that could benefit a number of different research fields including conditional generation, cross-modal understanding and expression recognition.
Realistic Speech-Driven Facial Animation with GANs
TLDR
This work presents an end-to-end system that generates videos of a talking head, using only a still image of a person and an audio clip containing speech, without relying on handcrafted intermediate features.
Audio-driven facial animation by joint end-to-end learning of pose and emotion
TLDR
This work presents a machine learning technique for driving 3D facial animation by audio input in real time and with low latency, and simultaneously discovers a compact, latent code that disambiguates the variations in facial expression that cannot be explained by the audio alone.
MakeItTalk: Speaker-Aware Talking Head Animation
TLDR
A method that generates expressive talking heads from a single facial image with audio as the only input that is able to synthesize photorealistic videos of entire talking heads with full range of motion and also animate artistic paintings, sketches, 2D cartoon characters, Japanese mangas, stylized caricatures in a single unified framework.
Animating Face using Disentangled Audio Representations
TLDR
This work proposes an explicit audio representation learning framework that disentangles audio sequences into various factors such as phonetic content, emotional tone, background noise and others and demonstrates that when conditioned on disentangled content representation, the generated mouth movement by the model is significantly more accurate than previous approaches in the presence of noise and emotional variations.
Deep video portraits
TLDR
The first to transfer the full 3D head position, head rotation, face expression, eye gaze, and eye blinking from a source actor to a portrait video of a target actor using only an input video is presented.
Hierarchical Cross-Modal Talking Face Generation With Dynamic Pixel-Wise Loss
TLDR
A cascade GAN approach to generate talking face video, which is robust to different face shapes, view angles, facial characteristics, and noisy audio conditions, and compared to a direct audio-to-image approach, this approach avoids fitting spurious correlations between audiovisual signals that are irrelevant to the speech content.
Neural Voice Puppetry: Audio-driven Facial Reenactment
TLDR
This work presents Neural Voice Puppetry, a novel approach for audio-driven facial video synthesis that generalizes across different people, allowing it to synthesize videos of a target actor with the voice of any unknown source actor or even synthetic voices that can be generated utilizing standard text-to-speech approaches.
Arbitrary Talking Face Generation via Attentional Audio-Visual Coherence Learning
TLDR
A novel arbitrary talking face generation framework is proposed by discovering the audio-visual coherence via the proposed Asymmetric Mutual Information Estimator (AMIE) and a Dynamic Attention (DA) block by selectively focusing the lip area of the input image during the training stage, to further enhance lip synchronization.
Talking Face Generation by Adversarially Disentangled Audio-Visual Representation
TLDR
This work finds that the talking face sequence is actually a composition of both subject- related information and speech-related information, and learns disentangled audio-visual representation, which has an advantage where both audio and video can serve as inputs for generation.
...
...