Production-level facial performance capture using deep convolutional neural networks

  title={Production-level facial performance capture using deep convolutional neural networks},
  author={Samuli Laine and Tero Karras and Timo Aila and Antti Herva and Shunsuke Saito and Ronald Yu and Hao Li and Jaakko Lehtinen},
  journal={Proceedings of the ACM SIGGRAPH / Eurographics Symposium on Computer Animation},
  • S. LaineTero Karras J. Lehtinen
  • Published 21 September 2016
  • Computer Science
  • Proceedings of the ACM SIGGRAPH / Eurographics Symposium on Computer Animation
We present a real-time deep learning framework for video-based facial performance capture---the dense 3D tracking of an actor's face given a monocular video. Our pipeline begins with accurately capturing a subject using a high-end production facial capture pipeline based on multi-view stereo tracking and artist-enhanced animations. With 5--10 minutes of captured footage, we train a convolutional neural network to produce high-quality output, including self-occluded regions, from a monocular… 

Figures and Tables from this paper

High-Quality Real Time Facial Capture Based on Single Camera

A real time deep learning framework for video-based facial expression capture that can drastically reduce the amount of labor involved in the development of modern narrative-driven video games or films involving realistic digital doubles of actors and potentially hours of animated dialogue per character.

Real-time 3D neural facial animation from binocular video

The system's ability to precisely capture subtle facial motions in unconstrained scenarios is demonstrated, in comparison to competing methods, on a diverse collection of identities, expressions, and real-world environments.

User‐Guided Lip Correction for Facial Performance Capture

A novel user‐guided approach to correcting common lip shape errors present in traditional capture systems is presented, to allow a user to manually correct a small number of problematic frames, and then the system learns the types of corrections desired and automatically corrects the entire performance.

Capture, Learning, and Synthesis of 3D Speaking Styles

A unique 4D face dataset with about 29 minutes of 4D scans captured at 60 fps and synchronized audio from 12 speakers is introduced and VOCA (Voice Operated Character Animation) is learned, the only realistic 3D facial animation model that is readily applicable to unseen subjects without retargeting.

Real-Time 3D Facial Tracking via Cascaded Compositional Learning

The experimental results indicate that the model trained purely on synthetic facial imageries can hardly generalize well to unconstrained real-world data, and involving synthetic faces into training benefits tracking in some certain scenarios but degrades the tracking model’s generalization ability.

Learning Dense Facial Correspondences in Unconstrained Images

This work presents a minimalists but effective neural network that computes dense facial correspondences in highly unconstrained RGB images and demonstrates successful per-frame processing under extreme pose variations, occlusions, and lighting conditions.

Modality Dropout for Improved Performance-driven Talking Faces

This work uses subjective testing to demonstrate the improvement of audiovisual-driven animation over the equivalent video-only approach, and the improvement in the animation of speech-related facial movements after introducing modality dropout.

Dynamic facial asset and rig generation from a single scan

This work proposes a framework for the automatic generation of high-quality dynamic facial models, including rigs which can be readily deployed for artists to polish, and demonstrates a highly robust and effective framework on a wide range of subjects.

Self-supervised CNN for Unconstrained 3D Facial Performance Capture from an RGB-D Camera

A novel method for real-time 3D facial performance capture with consumer-level RGB-D sensors that is robust to severe occlusion, fast motion, large rotation, exaggerated facial expressions and diverse lighting and augmenting the training data set in new ways is presented.

Fast and deep facial deformations

This paper presents a method using convolutional neural networks for approximating the mesh deformations of characters' faces that runs up to 17 times faster than the original facial rig while still maintaining a high level of fidelity to the original rig.



Real-Time Facial Segmentation and Performance Capture from RGB Input

A state-of-the-art regression-based facial tracking framework with segmented face images as training is adopted, and accurate and uninterrupted facial performance capture is demonstrated in the presence of extreme occlusion and even side views.

Video-audio driven real-time facial animation

A real-time facial tracking and animation system based on a Kinect sensor with video and audio input that efficiently fuses visual and acoustic information for 3D facial performance capture and generates more accurate 3D mouth motions than other approaches that are based on audio or video input only.

High-fidelity facial and speech animation for VR HMDs

This work introduces a novel system for HMD users to control a digital avatar in real-time while producing plausible speech animation and emotional expressions and demonstrates the quality of the system on a variety of subjects and evaluates its performance against state-of-the-art real- time facial tracking techniques.

Automatic acquisition of high-fidelity facial performances using monocular videos

A facial performance capture system that automatically captures high-fidelity facial performances using uncontrolled monocular videos and uses per-pixel shading cues to add fine-scale surface details such as emerging or disappearing wrinkles and folds into large-scale facial deformation to improve the accuracy of facial reconstruction.

Realtime facial animation with on-the-fly correctives

It is demonstrated that using an adaptive PCA model not only improves the fitting accuracy for tracking but also increases the expressiveness of the retargeted character.

Driving High-Resolution Facial Scans with Video Performance Capture

A process for rendering a realistic facial performance with control of viewpoint and illumination and optimally combines the weighted triangulation constraints, along with a shape regularization term, into a consistent 3D geometry solution over the entire performance that is drift free by construction.

Face2Face: Real-Time Face Capture and Reenactment of RGB Videos

A novel approach for real-time facial reenactment of a monocular target video sequence (e.g., Youtube video) that addresses the under-constrained problem of facial identity recovery from monocular video by non-rigid model-based bundling and re-render the manipulated output video in a photo-realistic fashion.

Real-time facial animation on mobile devices

Realtime performance-based facial animation

A novel face tracking algorithm that combines geometry and texture registration with pre-recorded animation priors in a single optimization is introduced that demonstrates that compelling 3D facial dynamics can be reconstructed in realtime without the use of face markers, intrusive lighting, or complex scanning hardware.

Displaced dynamic expression regression for real-time facial tracking and animation

This work presents a fully automatic approach to real-time facial tracking and animation with a single video camera that learns a generic regressor from public image datasets to infer accurate 2D facial landmarks as well as the 3D facial shape from 2D video frames.