• Corpus ID: 237592991

Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation

@article{Lu2021LiveSP,
  title={Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation},
  author={Yuanxun Lu and Jinxiang Chai and Xun Cao},
  journal={ArXiv},
  year={2021},
  volume={abs/2109.10595}
}
To the best of our knowledge, we first present a live system that generates personalized photorealistic talking-head animation only driven by audio signals at over 30 fps. Our system contains three stages. The first stage is a deep neural network that extracts deep audio features along with a manifold projection to project the features to the target person’s speech space. In the second stage, we learn facial dynamics and motions from the projected audio features. The predicted motions include… 
Live speech portraits
  • Yuanxun Lu, Jinxiang Chai, Xun Cao
  • ACM Transactions on Graphics
  • 2021
To the best of our knowledge, we first present a live system that generates personalized photorealistic talking-head animation only driven by audio signals at over 30 fps. Our system contains three
DFA-NeRF: Personalized Talking Head Generation via Disentangled Face Attributes Neural Rendering
TLDR
This work proposes a novel framework based on neural radiance field that takes lip movements features and personalized attributes as two disentangled conditions, where lip movements are directly predicted from the audio inputs to achieve lip-synchronized generation.

References

SHOWING 1-10 OF 78 REFERENCES
Deep video portraits
TLDR
The first to transfer the full 3D head position, head rotation, face expression, eye gaze, and eye blinking from a source actor to a portrait video of a target actor using only an input video is presented.
Realistic Speech-Driven Facial Animation with GANs
TLDR
This work presents an end-to-end system that generates videos of a talking head, using only a still image of a person and an audio clip containing speech, without relying on handcrafted intermediate features.
End-to-End Speech-Driven Facial Animation with Temporal GANs
TLDR
This work presents a system for generating videos of a talking head, using a still image of a person and an audio clip containing speech, that doesn't rely on any handcrafted intermediate features and is the first method capable of generating subject independent realistic videos directly from raw audio.
Predicting head pose from speech
TLDR
Algorithms for content driven speech animation; models that learn visual actions from data without semantic labelling, to predict realistic speech animation from recorded audio are developed.
Talking-head Generation with Rhythmic Head Motion
TLDR
This work proposes a 3D-aware generative network along with a hybrid embedding module and a non-linear composition module that achieves controllable, photo-realistic, and temporally coherent talking-head videos with natural head movements.
VisemeNet: Audio-Driven Animator-Centric Speech Animation
TLDR
A novel deep-learning based approach to producing animator-centric speech motion curves that drive a JALI or standard FACS-based production face-rig, directly from input audio, that integrates seamlessly into existing animation pipelines.
Audio-driven facial animation by joint end-to-end learning of pose and emotion
TLDR
This work presents a machine learning technique for driving 3D facial animation by audio input in real time and with low latency, and simultaneously discovers a compact, latent code that disambiguates the variations in facial expression that cannot be explained by the audio alone.
Capture, Learning, and Synthesis of 3D Speaking Styles
TLDR
A unique 4D face dataset with about 29 minutes of 4D scans captured at 60 fps and synchronized audio from 12 speakers is introduced and VOCA (Voice Operated Character Animation) is learned, the only realistic 3D facial animation model that is readily applicable to unseen subjects without retargeting.
Video-audio driven real-time facial animation
TLDR
A real-time facial tracking and animation system based on a Kinect sensor with video and audio input that efficiently fuses visual and acoustic information for 3D facial performance capture and generates more accurate 3D mouth motions than other approaches that are based on audio or video input only.
Joint Learning of Facial Expression and Head Pose from Speech
TLDR
A model architecture to encourage learning of rigid head motion via the latent space of the speaker’s facial activity is defined and the result is a model that can predict lip sync and other facial motion along with rigidHead motion directly from audible speech.
...
1
2
3
4
5
...