Learning Speech-driven 3D Conversational Gestures from Video
@article{Habibie2021LearningS3, title={Learning Speech-driven 3D Conversational Gestures from Video}, author={Ikhsanul Habibie and Weipeng Xu and Dushyant Mehta and Lingjie Liu and Hans-Peter Seidel and Gerard Pons-Moll and Mohamed A. Elgharib and Christian Theobalt}, journal={Proceedings of the 21st ACM International Conference on Intelligent Virtual Agents}, year={2021} }
We propose the first approach to synthesize the synchronous 3D conversational body and hand gestures, as well as 3D face and head animations, of a virtual character from speech input. Our algorithm uses a CNN architecture that leverages the inherent correlation between facial expression and hand gestures. Synthesis of conversational body gestures is a multi-modal problem since many similar gestures can plausibly accompany the same input speech. To synthesize plausible body gestures in this…
23 Citations
A Motion Matching-based Framework for Controllable Gesture Synthesis from Speech
- Computer ScienceSIGGRAPH
- 2022
This work proposes an approach for generating controllable 3D gestures that combines the advantage of database matching and deep generative modeling and proposes a conditional Generative Adversarial Network model to provide a data-driven refinement to the k-NN result by comparing its plausibility against the ground truth audio-gesture pairs.
FaceFormer: Speech-Driven 3D Facial Animation with Transformers
- Computer Science2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2022
A Transformer-based autoregressive model, Face-Former, is proposed, which encodes the long-term audio context and autoregressively predicts a sequence of animated 3D face meshes and devise two biased attention mechanisms well suited to this specific task.
CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior
- Computer ScienceArXiv
- 2023
This paper proposes to cast speech-driven facial animation as a code query task in a proxy space of the learned codebook, which effectively promotes the vividness of the generated motions by reducing the cross-modal mapping uncertainty.
Audio-Driven Co-Speech Gesture Video Generation
- Computer ScienceArXiv
- 2022
A novel framework to effectively capture the reusable co-speech gesture patterns as well as subtle rhythmic movements and an unsupervised motion representation instead of a structural human body prior is proposed to achieve high-fidelity image sequence generation.
Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation
- Computer Science2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2022
This work proposes a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation, and develops a contrastive learning strategy based on audio-text alignment for better audio representations.
Automatic facial expressions, gaze direction and head movements generation of a virtual agent
- Computer ScienceICMI Companion
- 2022
Two models to jointly and automatically generate the head, facial and gaze movements of a virtual agent from acoustic speech features are presented and shows that on 15 seconds sequences, encoder-decoder architecture drastically improves the perception of generated behaviours in two criteria: the coordination with speech and the naturalness.
BEAT: A Large-Scale Semantic and Emotional Multi-Modal Dataset for Conversational Gestures Synthesis
- Computer ScienceECCV
- 2022
This work builds the largest motion capture dataset for investigating human gestures, BEAT, and proposes a baseline model, CaMN, which consists of above six modalities modeled in a cascaded architecture for gesture synthesis, and introduces a metric, Semantic Relevance Gesture Recall (SRGR).
Pose augmentation: mirror the right way
- Computer ScienceIVA
- 2022
It is demonstrated that naive mirroring for augmentation has a detrimental effect on model performance and the method of providing a virtual speaker identity embedding improved performance over no augmentation and was competitive with a model trained on an equal number of samples of real data.
Integrated Speech and Gesture Synthesis
- Computer ScienceICMI
- 2021
The results show that participants rate one of the proposed integrated synthesis models as being as good as the state-of-the-art pipeline system the authors compare against, in all three tests.
Speech-Driven 3D Facial Animation with Mesh Convolution
- Computer Science2022 International Conference on Image Processing, Computer Vision and Machine Learning (ICICML)
- 2022
Facial animation has always been a high-profile problem in computer graphics and computer vision and has been extensively studied. But speech-driven realistic 3D face animation remains a challenging…
References
SHOWING 1-10 OF 61 REFERENCES
Synthesising 3D Facial Motion from “In-the-Wild” Speech
- Computer Science2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020)
- 2020
This paper introduces the first methodology for 3D facial motion synthesis from speech captured in arbitrary recording conditions (“in-the-wild”) and independent of the speaker, and shows the ability of a deep learning model to synthesise 3D faces in handling different speakers and continuous speech signals in uncontrolled conditions.
End-to-end Learning for 3D Facial Animation from Speech
- Computer ScienceICMI
- 2018
A deep learning framework for real-time speech-driven 3D facial animation from speech audio that automatically estimates emotional intensity of the speaker and reproduces her ever-changing affective states by adjusting strength of related facial unit activations.
Gesture generation with low-dimensional embeddings
- Computer Science, PsychologyAAMAS
- 2014
A novel machine learning approach is presented that decomposes the overall learning problem into learning two mappings: from speech to a gestural annotation and from Gestural annotation to gesture motion.
Audio-driven facial animation by joint end-to-end learning of pose and emotion
- Computer ScienceACM Trans. Graph.
- 2017
This work presents a machine learning technique for driving 3D facial animation by audio input in real time and with low latency, and simultaneously discovers a compact, latent code that disambiguates the variations in facial expression that cannot be explained by the audio alone.
Multi-objective adversarial gesture generation
- Computer ScienceMIG
- 2019
This work explores the use of a generative adversarial training paradigm to map speech to 3D gesture motion in a series of smaller sub-problems, including plausible gesture dynamics, realistic joint configurations, and diverse and smooth motion.
Evaluation of Speech-to-Gesture Generation Using Bi-Directional LSTM Network
- Computer ScienceIVA
- 2018
A novel framework to automatically generate natural gesture motions accompanying speech from audio utterances based on a Bi-Directional LSTM Network that regresses a full 3D skeletal pose of a human from perceptual features extracted from the input audio in each time step.
Predicting Co-verbal Gestures: A Deep and Temporal Modeling Approach
- Computer ScienceIVA
- 2015
A gestural sign scheme to facilitate supervised learning and the DCNF model, a model to jointly learn deep neural networks and second order linear chain temporal contingency are presented, which shows significant improvement over previous work on gesture prediction.
Talking With Hands 16.2M: A Large-Scale Dataset of Synchronized Body-Finger Motion and Audio for Conversational Motion Analysis and Synthesis
- Computer Science2019 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2019
The qualitative user study shows that the finger motion generated by the novel real-time finger motion synthesis method is perceived as natural and conversation enhancing, while the quantitative ablation study demonstrates the effectiveness of IK loss.
Video-audio driven real-time facial animation
- Computer ScienceACM Trans. Graph.
- 2015
A real-time facial tracking and animation system based on a Kinect sensor with video and audio input that efficiently fuses visual and acoustic information for 3D facial performance capture and generates more accurate 3D mouth motions than other approaches that are based on audio or video input only.
Style‐Controllable Speech‐Driven Gesture Synthesis Using Normalising Flows
- Computer ScienceComput. Graph. Forum
- 2020
This paper proposes a new generative model for generating state‐of‐the‐art realistic speech‐driven gesticulation, called MoGlow, and demonstrates the ability to exert directorial control over the output style, such as gesture level, speed, symmetry and spacial extent.