Audio-Driven Emotional Video Portraits
@article{Ji2021AudioDrivenEV, title={Audio-Driven Emotional Video Portraits}, author={Xinya Ji and Hang Zhou and Kaisiyuan Wang and Wayne Wu and Chen Change Loy and Xun Cao and Feng Xu}, journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2021}, pages={14075-14084} }
Despite previous success in generating audio-driven talking heads, most of the previous studies focus on the correlation between speech content and the mouth shape. Facial emotion, which is one of the most important features on natural human faces, is always neglected in their methods. In this work, we present Emotional Video Portraits (EVP), a system for synthesizing high-quality video portraits with vivid emotional dynamics driven by audios. Specifically, we propose the Cross-Reconstructed…Â
29 Citations
FaceFormer: Speech-Driven 3D Facial Animation with Transformers
- Computer ScienceArXiv
- 2021
A Transformer-based autoregressive model, FaceFormer, is proposed which encodes the long-term audio context and autoregressively predicts a sequence of animated 3D face meshes and devise two biased attention mechanisms well suited to this specific task, including the biased cross-modal multi-head (MH) attention and the biased causal MH self-attention with a periodic positional encoding strategy.
Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation
- Computer ScienceArXiv
- 2021
This work presents a live system that generates personalized photorealistic talking-head animation only driven by audio signals at over 30 fps and synthesizes high-fidelity personalized facial details, e.g., wrinkles, teeth.
Neural Emotion Director: Speech-preserving semantic control of facial expressions in "in-the-wild" videos
- Computer ScienceArXiv
- 2021
This method is the first to be capable of controlling the actor’s facial expressions by even using as a sole input the semantic labels of the manipulated emotions, while at the same time preserving the speech-related lip movements.
Live speech portraits
- Computer ScienceACM Transactions on Graphics
- 2021
This work presents a live system that generates personalized photorealistic talking-head animation only driven by audio signals at over 30 fps and demonstrates the superiority of the method over state-of-the-art techniques.
Neural Relighting and Expression Transfer On Video Portraits
- Computer Science
- 2021
A neural relighting and expression transfer technique to transfer the head pose and facial expressions from a source performer to a portrait video of a target performer while enabling dynamic relighting.
One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning
- Computer ScienceAAAI
- 2022
An Audio-Visual Correlation Transformer (AVCT) is developed that aims to infer talking motions represented by keypoint based dense motion fields from an input audio and can inherently generalize to audio spoken by other identities.
Imitating Arbitrary Talking Style for Realistic Audio-Driven Talking Face Synthesis
- Computer ScienceACM Multimedia
- 2021
This paper systematically investigates talking styles with the collected Ted-HD dataset and constructs style codes as several statistics of 3D morphable model (3DMM) parameters, and devise a latent-style-fusion (LSF) model to synthesize stylized talking faces by imitating talking styles from the style codes.
Relightable Neural Video Portrait
- ArtArXiv
- 2021
Figure 1. We introduce a relightable neural video portrait scheme for simultaneous relighting and reenactment that transfers the head pose and facial expressions from a source actor to a portrait…
AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis
- Computer Science2021 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2021
Experimental results demonstrate that the novel framework can produce high-fidelity and natural results, and support free adjustment of audio signals, viewing directions, and background images.
Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation
- Computer Science2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2021
This paper proposes a clean yet effective framework to generate pose-controllable talking faces whose poses are controllable by other videos and has multiple advanced capabilities including extreme view robustness and talking face frontalization.
References
SHOWING 1-10 OF 52 REFERENCES
MEAD: A Large-Scale Audio-Visual Dataset for Emotional Talking-Face Generation
- Computer ScienceECCV
- 2020
The Multi-view Emotional Audio-visual Dataset (MEAD) is built, a talking-face video corpus featuring 60 actors and actresses talking with eight different emotions at three different intensity levels that could benefit a number of different research fields including conditional generation, cross-modal understanding and expression recognition.
Realistic Speech-Driven Facial Animation with GANs
- Computer ScienceInternational Journal of Computer Vision
- 2019
This work presents an end-to-end system that generates videos of a talking head, using only a still image of a person and an audio clip containing speech, without relying on handcrafted intermediate features.
Audio-driven facial animation by joint end-to-end learning of pose and emotion
- Computer ScienceACM Trans. Graph.
- 2017
This work presents a machine learning technique for driving 3D facial animation by audio input in real time and with low latency, and simultaneously discovers a compact, latent code that disambiguates the variations in facial expression that cannot be explained by the audio alone.
MakeItTalk: Speaker-Aware Talking Head Animation
- Computer ScienceACM Trans. Graph.
- 2020
A method that generates expressive talking heads from a single facial image with audio as the only input that is able to synthesize photorealistic videos of entire talking heads with full range of motion and also animate artistic paintings, sketches, 2D cartoon characters, Japanese mangas, stylized caricatures in a single unified framework.
Animating Face using Disentangled Audio Representations
- Computer Science2020 IEEE Winter Conference on Applications of Computer Vision (WACV)
- 2020
This work proposes an explicit audio representation learning framework that disentangles audio sequences into various factors such as phonetic content, emotional tone, background noise and others and demonstrates that when conditioned on disentangled content representation, the generated mouth movement by the model is significantly more accurate than previous approaches in the presence of noise and emotional variations.
Deep video portraits
- Computer ScienceACM Trans. Graph.
- 2018
The first to transfer the full 3D head position, head rotation, face expression, eye gaze, and eye blinking from a source actor to a portrait video of a target actor using only an input video is presented.
Hierarchical Cross-Modal Talking Face Generation With Dynamic Pixel-Wise Loss
- Computer Science2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2019
A cascade GAN approach to generate talking face video, which is robust to different face shapes, view angles, facial characteristics, and noisy audio conditions, and compared to a direct audio-to-image approach, this approach avoids fitting spurious correlations between audiovisual signals that are irrelevant to the speech content.
Neural Voice Puppetry: Audio-driven Facial Reenactment
- Computer ScienceECCV
- 2020
This work presents Neural Voice Puppetry, a novel approach for audio-driven facial video synthesis that generalizes across different people, allowing it to synthesize videos of a target actor with the voice of any unknown source actor or even synthetic voices that can be generated utilizing standard text-to-speech approaches.
Arbitrary Talking Face Generation via Attentional Audio-Visual Coherence Learning
- Computer ScienceIJCAI
- 2020
A novel arbitrary talking face generation framework is proposed by discovering the audio-visual coherence via the proposed Asymmetric Mutual Information Estimator (AMIE) and a Dynamic Attention (DA) block by selectively focusing the lip area of the input image during the training stage, to further enhance lip synchronization.
Talking Face Generation by Adversarially Disentangled Audio-Visual Representation
- Computer ScienceAAAI
- 2019
This work finds that the talking face sequence is actually a composition of both subject- related information and speech-related information, and learns disentangled audio-visual representation, which has an advantage where both audio and video can serve as inputs for generation.