Corpus ID: 237592991

Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation

@article{Lu2021LiveSP,
  title={Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation},
  author={Yuanxun Lu and Jinxiang Chai and Xun Cao},
  journal={ArXiv},
  year={2021},
  volume={abs/2109.10595}
}
  • Yuanxun Lu, Jinxiang Chai, Xun Cao
  • Published 22 September 2021
  • Computer Science
  • ArXiv
To the best of our knowledge, we first present a live system that generates personalized photorealistic talking-head animation only driven by audio signals at over 30 fps. Our system contains three stages. The first stage is a deep neural network that extracts deep audio features along with a manifold projection to project the features to the target person’s speech space. In the second stage, we learn facial dynamics and motions from the projected audio features. The predicted motions include… Expand

References

SHOWING 1-10 OF 78 REFERENCES
Deep video portraits
TLDR
The first to transfer the full 3D head position, head rotation, face expression, eye gaze, and eye blinking from a source actor to a portrait video of a target actor using only an input video is presented. Expand
Realistic Speech-Driven Facial Animation with GANs
TLDR
This work presents an end-to-end system that generates videos of a talking head, using only a still image of a person and an audio clip containing speech, without relying on handcrafted intermediate features. Expand
End-to-End Speech-Driven Facial Animation with Temporal GANs
TLDR
This work presents a system for generating videos of a talking head, using a still image of a person and an audio clip containing speech, that doesn't rely on any handcrafted intermediate features and is the first method capable of generating subject independent realistic videos directly from raw audio. Expand
Predicting head pose from speech
TLDR
Algorithms for content driven speech animation; models that learn visual actions from data without semantic labelling, to predict realistic speech animation from recorded audio are developed. Expand
Talking-head Generation with Rhythmic Head Motion
TLDR
This work proposes a 3D-aware generative network along with a hybrid embedding module and a non-linear composition module that achieves controllable, photo-realistic, and temporally coherent talking-head videos with natural head movements. Expand
VisemeNet: Audio-Driven Animator-Centric Speech Animation
TLDR
A novel deep-learning based approach to producing animator-centric speech motion curves that drive a JALI or standard FACS-based production face-rig, directly from input audio, that integrates seamlessly into existing animation pipelines. Expand
Audio-driven facial animation by joint end-to-end learning of pose and emotion
TLDR
This work presents a machine learning technique for driving 3D facial animation by audio input in real time and with low latency, and simultaneously discovers a compact, latent code that disambiguates the variations in facial expression that cannot be explained by the audio alone. Expand
Capture, Learning, and Synthesis of 3D Speaking Styles
TLDR
A unique 4D face dataset with about 29 minutes of 4D scans captured at 60 fps and synchronized audio from 12 speakers is introduced and VOCA (Voice Operated Character Animation) is learned, the only realistic 3D facial animation model that is readily applicable to unseen subjects without retargeting. Expand
Video-audio driven real-time facial animation
TLDR
A real-time facial tracking and animation system based on a Kinect sensor with video and audio input that efficiently fuses visual and acoustic information for 3D facial performance capture and generates more accurate 3D mouth motions than other approaches that are based on audio or video input only. Expand
Joint Learning of Facial Expression and Head Pose from Speech
TLDR
A model architecture to encourage learning of rigid head motion via the latent space of the speaker’s facial activity is defined and the result is a model that can predict lip sync and other facial motion along with rigidHead motion directly from audible speech. Expand
...
1
2
3
4
5
...