AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis

  title={AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis},
  author={Yudong Guo and Keyu Chen and Sen Liang and Yongjin Liu and Hujun Bao and Juyong Zhang},
  journal={2021 IEEE/CVF International Conference on Computer Vision (ICCV)},
Generating high-fidelity talking head video by fitting with the input audio sequence is a challenging problem that receives considerable attentions recently. In this paper, we address this problem with the aid of neural scene representation networks. Our method is completely different from existing methods that rely on intermediate representations like 2D landmarks or 3D face models to bridge the gap between audio input and video output. Specifically, the feature of input audio signal is… 

Figures and Tables from this paper

Learning Dynamic Facial Radiance Fields for Few-Shot Talking Head Synthesis

Different from the existing NeRF-based methods which directly encode the 3D geometry and appearance of a specific person into the network, the DFRF conditions face radiance field on 2D appearance images to learn the face prior, which can be flexibly adjusted to the new identity with few reference images.

DialogueNeRF: Towards Realistic Avatar Face-to-face Conversation Video Generation

A new framework is proposed that utilizes a series of conversation signals, e.g., audio, head pose, and expression, to synthesize face-to-face conversation videos between human avatars, with all the interlocutors modeled within the same network.

Semantic-Aware Implicit Neural Audio-Driven Video Portrait Generation

Semantic-aware Speaking Portrait NeRF (SSPNeRF), which creates delicate audio-driven portraits using one unified set of NeRF through two semantic-aware modules, and renders more realistic video portraits compared to previous methods.

DFA-NeRF: Personalized Talking Head Generation via Disentangled Face Attributes Neural Rendering

This work proposes a novel framework based on neural radiance field that takes lip movements features and personalized attributes as two disentangled conditions, where lip movements are directly predicted from the audio inputs to achieve lip-synchronized generation.

Multimodal Image Synthesis and Editing: A Survey

This survey comprehensively contextualize the advance of the recent multimodal image synthesis and editing and formulate taxonomies according to data modality and model architectures.

Deep Person Generation: A Survey from the Perspective of Face, Pose and Cloth Synthesis

The scope of person generation is summarized, and a systematically review recent progress and technical trends in deep person generation are reviewed, covering three major tasks: talking-head generation (face), pose-guided person generation (pose) and garment-oriented persongeneration (cloth).

SelfNeRF: Fast Training NeRF for Human from Monocular Self-rotating Video

In this paper, we propose SelfNeRF, an efficient neural radiance field based novel view synthesis method for human performance. Given monocular self-rotating videos of human performers, SelfNeRF can

NeRF: Neural Radiance Field in 3D Vision, A Comprehensive Review

—Neural Radiance Field (NeRF), a new novel view synthesis with implicit scene representation has taken the field of Computer Vision by storm. As a novel view synthesis and 3D reconstruction method,

3DMM-RF: Convolutional Radiance Fields for 3D Face Modeling

This work presents a facial 3D Morphable Model, which can accurately model a subject’s identity, pose and expression and render it in arbitrary illumination, and introduces a style-based generative network that synthesizes in one pass all and only the required rendering samples of a neural radiance.

Explicitly Controllable 3D-Aware Portrait Generation

This work proposes a network that generates 3D-aware portraits while being controllable according to semantic parameters regarding pose, identity, expression and illumination, and demonstrates generalization ability to real images as well as out-of-domain data.



Audio-driven Talking Face Video Generation with Natural Head Pose

A deep neural network model is proposed that takes an audio signal A of a source person and a very short video V of a target person as input, and outputs a synthesized highquality talking face video with natural head pose, expression and lip synchronization, outperforming the state-of-the-art methods.

Realistic Speech-Driven Facial Animation with GANs

This work presents an end-to-end system that generates videos of a talking head, using only a still image of a person and an audio clip containing speech, without relying on handcrafted intermediate features.

Hierarchical Cross-Modal Talking Face Generation With Dynamic Pixel-Wise Loss

A cascade GAN approach to generate talking face video, which is robust to different face shapes, view angles, facial characteristics, and noisy audio conditions, and compared to a direct audio-to-image approach, this approach avoids fitting spurious correlations between audiovisual signals that are irrelevant to the speech content.

One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing

A neural talking- head video synthesis model that learns to synthesize a talking-head video using a source image containing the target person’s appearance and a driving video that dictates the motion in the output is proposed.

Neural Voice Puppetry: Audio-driven Facial Reenactment

This work presents Neural Voice Puppetry, a novel approach for audio-driven facial video synthesis that generalizes across different people, allowing it to synthesize videos of a target actor with the voice of any unknown source actor or even synthetic voices that can be generated utilizing standard text-to-speech approaches.

Audio-driven facial animation by joint end-to-end learning of pose and emotion

This work presents a machine learning technique for driving 3D facial animation by audio input in real time and with low latency, and simultaneously discovers a compact, latent code that disambiguates the variations in facial expression that cannot be explained by the audio alone.

VisemeNet: Audio-Driven Animator-Centric Speech Animation

A novel deep-learning based approach to producing animator-centric speech motion curves that drive a JALI or standard FACS-based production face-rig, directly from input audio, that integrates seamlessly into existing animation pipelines.

D-NeRF: Neural Radiance Fields for Dynamic Scenes

D-NeRF is introduced, a method that extends neural radiance fields to a dynamic domain, allowing to reconstruct and render novel images of objects under rigid and non-rigid motions from a single camera moving around the scene.

Capture, Learning, and Synthesis of 3D Speaking Styles

A unique 4D face dataset with about 29 minutes of 4D scans captured at 60 fps and synchronized audio from 12 speakers is introduced and VOCA (Voice Operated Character Animation) is learned, the only realistic 3D facial animation model that is readily applicable to unseen subjects without retargeting.

Learning Compositional Radiance Fields of Dynamic Human Heads

This work proposes a novel compositional 3D representation that combines the best of previous methods to produce both higher-resolution and faster results and shows that the learned dynamic radiance field can be used to synthesize novel unseen expressions based on a global animation code.