Learning Speech-driven 3D Conversational Gestures from Video

  title={Learning Speech-driven 3D Conversational Gestures from Video},
  author={Ikhsanul Habibie and Weipeng Xu and Dushyant Mehta and Lingjie Liu and Hans-Peter Seidel and Gerard Pons-Moll and Mohamed A. Elgharib and Christian Theobalt},
  journal={Proceedings of the 21st ACM International Conference on Intelligent Virtual Agents},
  • I. HabibieWeipeng Xu C. Theobalt
  • Published 13 February 2021
  • Computer Science
  • Proceedings of the 21st ACM International Conference on Intelligent Virtual Agents
We propose the first approach to synthesize the synchronous 3D conversational body and hand gestures, as well as 3D face and head animations, of a virtual character from speech input. Our algorithm uses a CNN architecture that leverages the inherent correlation between facial expression and hand gestures. Synthesis of conversational body gestures is a multi-modal problem since many similar gestures can plausibly accompany the same input speech. To synthesize plausible body gestures in this… 

Figures and Tables from this paper

A Motion Matching-based Framework for Controllable Gesture Synthesis from Speech

This work proposes an approach for generating controllable 3D gestures that combines the advantage of database matching and deep generative modeling and proposes a conditional Generative Adversarial Network model to provide a data-driven refinement to the k-NN result by comparing its plausibility against the ground truth audio-gesture pairs.

FaceFormer: Speech-Driven 3D Facial Animation with Transformers

A Transformer-based autoregressive model, Face-Former, is proposed, which encodes the long-term audio context and autoregressively predicts a sequence of animated 3D face meshes and devise two biased attention mechanisms well suited to this specific task.

CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior

This paper proposes to cast speech-driven facial animation as a code query task in a proxy space of the learned codebook, which effectively promotes the vividness of the generated motions by reducing the cross-modal mapping uncertainty.

Audio-Driven Co-Speech Gesture Video Generation

A novel framework to effectively capture the reusable co-speech gesture patterns as well as subtle rhythmic movements and an unsupervised motion representation instead of a structural human body prior is proposed to achieve high-fidelity image sequence generation.

Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation

  • Xian LiuQianyi Wu Bolei Zhou
  • Computer Science
    2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2022
This work proposes a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation, and develops a contrastive learning strategy based on audio-text alignment for better audio representations.

Automatic facial expressions, gaze direction and head movements generation of a virtual agent

Two models to jointly and automatically generate the head, facial and gaze movements of a virtual agent from acoustic speech features are presented and shows that on 15 seconds sequences, encoder-decoder architecture drastically improves the perception of generated behaviours in two criteria: the coordination with speech and the naturalness.

BEAT: A Large-Scale Semantic and Emotional Multi-Modal Dataset for Conversational Gestures Synthesis

This work builds the largest motion capture dataset for investigating human gestures, BEAT, and proposes a baseline model, CaMN, which consists of above six modalities modeled in a cascaded architecture for gesture synthesis, and introduces a metric, Semantic Relevance Gesture Recall (SRGR).

Pose augmentation: mirror the right way

It is demonstrated that naive mirroring for augmentation has a detrimental effect on model performance and the method of providing a virtual speaker identity embedding improved performance over no augmentation and was competitive with a model trained on an equal number of samples of real data.

Integrated Speech and Gesture Synthesis

The results show that participants rate one of the proposed integrated synthesis models as being as good as the state-of-the-art pipeline system the authors compare against, in all three tests.

Speech-Driven 3D Facial Animation with Mesh Convolution

Facial animation has always been a high-profile problem in computer graphics and computer vision and has been extensively studied. But speech-driven realistic 3D face animation remains a challenging



Synthesising 3D Facial Motion from “In-the-Wild” Speech

This paper introduces the first methodology for 3D facial motion synthesis from speech captured in arbitrary recording conditions (“in-the-wild”) and independent of the speaker, and shows the ability of a deep learning model to synthesise 3D faces in handling different speakers and continuous speech signals in uncontrolled conditions.

End-to-end Learning for 3D Facial Animation from Speech

A deep learning framework for real-time speech-driven 3D facial animation from speech audio that automatically estimates emotional intensity of the speaker and reproduces her ever-changing affective states by adjusting strength of related facial unit activations.

Gesture generation with low-dimensional embeddings

A novel machine learning approach is presented that decomposes the overall learning problem into learning two mappings: from speech to a gestural annotation and from Gestural annotation to gesture motion.

Audio-driven facial animation by joint end-to-end learning of pose and emotion

This work presents a machine learning technique for driving 3D facial animation by audio input in real time and with low latency, and simultaneously discovers a compact, latent code that disambiguates the variations in facial expression that cannot be explained by the audio alone.

Multi-objective adversarial gesture generation

This work explores the use of a generative adversarial training paradigm to map speech to 3D gesture motion in a series of smaller sub-problems, including plausible gesture dynamics, realistic joint configurations, and diverse and smooth motion.

Evaluation of Speech-to-Gesture Generation Using Bi-Directional LSTM Network

A novel framework to automatically generate natural gesture motions accompanying speech from audio utterances based on a Bi-Directional LSTM Network that regresses a full 3D skeletal pose of a human from perceptual features extracted from the input audio in each time step.

Predicting Co-verbal Gestures: A Deep and Temporal Modeling Approach

A gestural sign scheme to facilitate supervised learning and the DCNF model, a model to jointly learn deep neural networks and second order linear chain temporal contingency are presented, which shows significant improvement over previous work on gesture prediction.

Talking With Hands 16.2M: A Large-Scale Dataset of Synchronized Body-Finger Motion and Audio for Conversational Motion Analysis and Synthesis

The qualitative user study shows that the finger motion generated by the novel real-time finger motion synthesis method is perceived as natural and conversation enhancing, while the quantitative ablation study demonstrates the effectiveness of IK loss.

Video-audio driven real-time facial animation

A real-time facial tracking and animation system based on a Kinect sensor with video and audio input that efficiently fuses visual and acoustic information for 3D facial performance capture and generates more accurate 3D mouth motions than other approaches that are based on audio or video input only.

Style‐Controllable Speech‐Driven Gesture Synthesis Using Normalising Flows

This paper proposes a new generative model for generating state‐of‐the‐art realistic speech‐driven gesticulation, called MoGlow, and demonstrates the ability to exert directorial control over the output style, such as gesture level, speed, symmetry and spacial extent.