• Corpus ID: 237091850

Audio2Gestures: Generating Diverse Gestures from Speech Audio with Conditional Variational Autoencoders

  title={Audio2Gestures: Generating Diverse Gestures from Speech Audio with Conditional Variational Autoencoders},
  author={Jing Li and Di Kang and Wenjie Pei and Xuefei Zhe and Ying Zhang and Zhenyu He and Linchao Bao},
  • Jing Li, Di Kang, +4 authors Linchao Bao
  • Published 15 August 2021
  • Computer Science
  • ArXiv
Generating conversational gestures from speech audio is challenging due to the inherent one-to-many mapping between audio and body motions. Conventional CNNs/RNNs assume one-to-one mapping, and thus tend to predict the average of all possible target motions, resulting in plain/boring motions during inference. In order to overcome this problem, we propose a novel conditional variational autoencoder (VAE) that explicitly models one-tomany audio-to-motion mapping by splitting the cross-modal… 

Figures and Tables from this paper


Human Motion Prediction via Spatio-Temporal Inpainting
This work argues that the L2 metric, considered so far by most approaches, fails to capture the actual distribution of long-term human motion, and proposes two alternative metrics, based on the distribution of frequencies, that are able to capture more realistic motion patterns.
S3VAE: Self-Supervised Sequential VAE for Representation Disentanglement and Data Generation
A sequential variational autoencoder to learn disentangled representations of sequential data under self-supervision that performs comparable to, if not better than, the fully-supervised model with ground truth labels, and outperforms state-of-the-art unsupervised models by a large margin.
MT-VAE: Learning Motion Transformations to Generate Multimodal Human Dynamics
This work presents a novel Motion Transformation Variational Auto-Encoders (MT-VAE) for learning motion sequence generation that jointly learns a feature embedding for motion modes and a feature transformation that represents the transition of one motion mode to the next motion mode.
Style‐Controllable Speech‐Driven Gesture Synthesis Using Normalising Flows
This paper proposes a new generative model for generating state‐of‐the‐art realistic speech‐driven gesticulation, called MoGlow, and demonstrates the ability to exert directorial control over the output style, such as gesture level, speed, symmetry and spacial extent.
Real-time prosody-driven synthesis of body language
This work presents a method for automatically synthesizing body language animations directly from the participants' speech signals, without the need for additional input, suitable for animating characters from live human speech.
MoCoGAN: Decomposing Motion and Content for Video Generation
This work introduces a novel adversarial learning scheme utilizing both image and video discriminators and shows that MoCoGAN allows one to generate videos with same content but different motion as well as videos with different content and same motion.
Investigating the use of recurrent motion modelling for speech gesture generation
This work explores the use of transfer learning using previous motion modelling research to improve learning outcomes for gesture generation from speech, using a recurrent network with an encoder-decoder structure that takes in prosodic speech features and generates a short sequence of gesture motion.
Mode Seeking Generative Adversarial Networks for Diverse Image Synthesis
This work proposes a simple yet effective regularization term to address the mode collapse issue for cGANs and explicitly maximizes the ratio of the distance between generated images with respect to the corresponding latent codes, thus encouraging the generators to explore more minor modes during training.
Character controllers using motion VAEs
This work uses deep reinforcement learning to learn controllers that achieve goal-directed movements in data-driven generative models of human movement using autoregressive conditional variational autoencoders, or Motion VAEs.
Learning Individual Styles of Conversational Gesture
A method for cross-modal translation from "in-the-wild" monologue speech of a single speaker to their conversational gesture motion is presented and significantly outperforms baseline methods in a quantitative comparison.