Speech2AffectiveGestures: Synthesizing Co-Speech Gestures with Generative Adversarial Affective Expression Learning

  title={Speech2AffectiveGestures: Synthesizing Co-Speech Gestures with Generative Adversarial Affective Expression Learning},
  author={Uttaran Bhattacharya and Elizabeth Childs and Nicholas Rewkowski and Dinesh Manocha},
  journal={Proceedings of the 29th ACM International Conference on Multimedia},
We present a generative adversarial network to synthesize 3D pose sequences of co-speech upper-body gestures with appropriate affective expressions. Our network consists of two components: a generator to synthesize gestures from a joint embedding space of features encoded from the input speech and the seed poses, and a discriminator to distinguish between the synthesized pose sequences and real 3D pose sequences. We leverage the Mel-frequency cepstral coefficients and the text transcript… 

Figures and Tables from this paper

A Motion Matching-based Framework for Controllable Gesture Synthesis from Speech

This work proposes an approach for generating controllable 3D gestures that combines the advantage of database matching and deep generative modeling and proposes a conditional Generative Adversarial Network model to provide a data-driven refinement to the k-NN result by comparing its plausibility against the ground truth audio-gesture pairs.

BEAT: A Large-Scale Semantic and Emotional Multi-Modal Dataset for Conversational Gestures Synthesis

This work builds the largest motion capture dataset for investigating human gestures, BEAT, and proposes a baseline model, CaMN, which consists of above six modalities modeled in a cascaded architecture for gesture synthesis, and introduces a metric, Semantic Relevance Gesture Recall (SRGR).

Agree or Disagreeƒ Generating Body Gestures from Affective Contextual Cues during Dyadic Interactions

This paper proposes a method based on conditional Generative Adversarial Networks, intending to generate behaviours for a robot in affective dyadic interactions under agreement and disagreement scenarios, and shows that Context Encoder can better contribute to the prediction of co-speech gestures in agreement situations.

A Comprehensive Review of Data-Driven Co-Speech Gesture Generation

Key research challenges in gesture generation are identified, including data availability and quality; producing human-like motion; grounding the gesture in the co-occurring speech in interaction with other speakers, and in the environment; performing gesture evaluation; and integration of gesture synthesis into applications.

Rhythmic Gesticulator

A novel co-speech gesture synthesis method that achieves convincing results both on the rhythm and semantics, and builds correspondence between the hierarchical embeddings of the speech and the motion, resulting in rhythm- and semantics-aware gesture synthesis.

DisCo: Disentangled Implicit Content and Rhythm Learning for Diverse Co-Speech Gestures Synthesis

This work presents DisCo, which disentangles motion into implicit content and rhythm features by contrastive loss for adopting different data balance strategies, and designs a diversity-and-inclusion network (DIN), which firstly generates content features candidates and then selects one candidate by learned voting.

Audio-Driven Stylized Gesture Generation with Flow-Based Model

A new end-to-end flow-based model is proposed, which can generate audio-driven gestures of arbitrary styles with neither preprocessing nor style labels, and outperforms state-of-the-art models.

TEMOS: Generating diverse human motions from textual descriptions

This work proposes TEMOS, a text-conditioned generative model leveraging variational autoencoder (VAE) training with human motion data, in combination with a text encoder that produces distribution parameters compatible with the VAE latent space.

The DeepMotion entry to the GENEA Challenge 2022

This paper proposed a two-stage model to address uncertainty issue in gesture synthesis and the user evaluation results show the proposed method is able to produce gesture motions with reasonable human-likeness and gesture appropriateness.

The IVI Lab entry to the GENEA Challenge 2022 – A Tacotron2 Based Method for Co-Speech Gesture Generation With Locality-Constraint Attention Mechanism

The gesture generation problem is formulated as a sequence-to-sequence conversion task with text, audio, and speaker identity as inputs and the body motion as the output and the result indicates that the motion distribution of the generated gestures is much closer to the distribution of natural gestures.



Learning Unseen Emotions from Gestures via Semantically-Conditioned Zero-Shot Perception with Adversarial Autoencoders

An adversarial autoencoder-based representation learning that correlates 3D motion-captured gesture sequences with the vectorized representation of the natural-language perceived emotion terms using word2vec embeddings.

Evaluation of Speech-to-Gesture Generation Using Bi-Directional LSTM Network

A novel framework to automatically generate natural gesture motions accompanying speech from audio utterances based on a Bi-Directional LSTM Network that regresses a full 3D skeletal pose of a human from perceptual features extracted from the input audio in each time step.

Take an Emotion Walk: Perceiving Emotions from Gaits Using Hierarchical Attention Pooling and Affective Mapping

An autoencoder-based semi-supervised approach to classify perceived human emotions from walking styles obtained from videos or motion-captured data and represented as sequences of 3D poses outperform current state-of-art algorithms for both emotion recognition and action recognition from 3D gaits by 7%--23% on the absolute.

Novel Realizations of Speech-Driven Head Movements with Generative Adversarial Networks

  • Najmeh SadoughiC. Busso
  • Computer Science
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
A conditional GAN with bidirectional long-short term memory (BLSTM), which is suitable for capturing the long- short term dependencies of time-continuous signals and compares this model with a dynamic Bayesian network (DBN) and BLSTM models optimized to reduce mean squared error (MSE) or to increase concordance correlation.

Speech gesture generation from the trimodal context of text, audio, and speaker identity

This paper presents an automatic gesture generation model that uses the multimodal context of speech text, audio, and speaker identity to reliably generate gestures that are human-like and that match with speech content and rhythm.

Audio-driven facial animation by joint end-to-end learning of pose and emotion

This work presents a machine learning technique for driving 3D facial animation by audio input in real time and with low latency, and simultaneously discovers a compact, latent code that disambiguates the variations in facial expression that cannot be explained by the audio alone.

Gesticulator: A framework for semantically-aware speech-driven gesture generation

This work presents a model designed to produce arbitrary beat and semantic gestures together, which takes both acoustic and semantic representations of speech as input, and generates gestures as a sequence of joint angle rotations as output.

Predicting Co-verbal Gestures: A Deep and Temporal Modeling Approach

A gestural sign scheme to facilitate supervised learning and the DCNF model, a model to jointly learn deep neural networks and second order linear chain temporal contingency are presented, which shows significant improvement over previous work on gesture prediction.

Predicting Head Pose from Speech with a Conditional Variational Autoencoder

This work employs Deep Bi-Directional LSTMs capable of learning long-term structure in language, and introduces a generative head motion model, conditioned on audio features using a Conditional Variational Autoencoder (CVAE), which mitigates the problems of the one to many mapping that a speech to head pose model must accommodate.

Multi-objective adversarial gesture generation

This work explores the use of a generative adversarial training paradigm to map speech to 3D gesture motion in a series of smaller sub-problems, including plausible gesture dynamics, realistic joint configurations, and diverse and smooth motion.