Transflower: probabilistic autoregressive dance generation with multimodal attention

  title={Transflower: probabilistic autoregressive dance generation with multimodal attention},
  author={Guillermo Valle P{\'e}rez and Gustav Eje Henter and Jonas Beskow and Andre Holzapfel and Pierre-Yves Oudeyer and Simon Alexanderson},
  journal={ACM Trans. Graph.},
composition of movements follow rhythmic, tonal and timbral features of music. Formally, generating dance conditioned on a piece of music can be expressed as a problem of modelling a high-dimensional continuous motion signal, conditioned on an audio signal. In this work we make two contributions to tackle this problem. First, we present a novel probabilistic autoregressive architecture that models the distribution over future poses with a normalizing flow conditioned on previous poses as well… 

Figures and Tables from this paper

Dance Style Transfer with Cross-modal Transformer

The method extends an existing CycleGAN architecture for modeling audio sequences and integrates multimodal transformer encoders to account for music context and adopts sequence length-based curriculum learning to stabilize training.

ChoreoGraph: Music-conditioned Automatic Dance Choreography over a Style and Tempo Consistent Dynamic Graph

ChoreoGraph is proposed, which choreographs high-quality dance motion for a given piece of music over a Dynamic Graph, and demonstrates that this repertoire-based framework can generate motions with aesthetic consistency and robustly extensing in diversity.

DeepPhase: periodic autoencoders for learning motion phase manifolds

It is demonstrated that the learned periodic embedding can significantly help to improve neural motion synthesis in a number of tasks, including diverse locomotion skills, style-based move- ments, dance motion synthesis from music, synthesis of dribbling motions in football, and motion query for matching poses within large animation databases.

Rhythmic Gesticulator: Rhythm-Aware Co-Speech Gesture Synthesis with Hierarchical Neural Embeddings

we devise a mechanism to effectively disentangle both low- and high-level neural embeddings of speech and motion based on linguistic theory. The high-level embedding corresponds to semantics, while

Multi-Scale Cascaded Generator for Music-driven Dance Synthesis

A multi-scale cascaded music-driven dance synthesis network (MC-MDSN) that first generates big-scale body motions conditioned on music and then further refines local small-scale joint motions and design a multi- scale feature loss to capture the dynamic characteristics of each scale motions and the relations between different scale motion joints.

ZeroEGGS: Zero-shot Example-based Gesture Generation from Speech

We present ZeroEGGS, a neural network framework for speech-driven gesture generation with zero-shot style control by example. This means style can be controlled via only a short example motion clip,

TEACH: Temporal Action Composition for 3D Humans

An approach to enable the synthesis of a series of actions, called TEACH for “TEmporal Action Compositions for Human motions”, produces realistic human motions for a wide variety of actions and temporal compositions from language descriptions.


It is demonstrated that the learned periodic embedding can significantly help to improve neural motion synthesis in a number of tasks, including diverse locomotion skills, style-based movements, dance motion synthesis from music, synthesis of dribbling motions in football, and motion query for matching poses within large animation databases.

Music-to-Dance Generation with Multiple Conformer

Quantitative and qualitative experimental results on the publicly available music-to-dance dataset demonstrate the proposed novel autoregressive generative framework improves greatly upon the baselines and can generate long-term coherent dance motions well-coordinated with the music.

Adversarial Attention for Human Motion Synthesis

This work presents a novel method for controllable human motion synthesis by applying attention-based probabilistic deep adversarial models with end-to-end training and shows that it can generate synthetic human motion over both short- and long-time horizons through the use of adversarial attention.



Learn to Dance with AIST++: Music Conditioned 3D Dance Generation

A transformer-based learning framework for 3D dance generation conditioned on music that combines a deep crossmodal transformer, which well learns the correlation between the music and dance motion; and the full-attention with future-N supervision mechanism which is essential in producing long-range non-freezing motion.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

MoGlow: Probabilistic and controllable motion synthesis using normalising flows

Data-driven modelling and synthesis of motion is an active research area with applications that include animation, games, and social robotics. This paper introduces a new class of probabilistic,

DanceNet3D: Music Based Dance Generation with Parametric Motion Transformer

Extensive experiments demonstrate that the proposed approach can generate fluent, elegant, performative and beat-synchronized 3D dances, which significantly surpasses previous works quantitatively and qualitatively.

Predicting Video with VQVAE

This paper proposes a novel approach to video prediction with Vector Quantized Variational AutoEncoders (VQ-VAE), which compress high-resolution videos into a hierarchical set of multi-scale discrete latent variables, allowing it to apply scalable autoregressive generative models to predict video.

GrooveNet : Real-Time Music-Driven Dance Movement Generation using Artificial Neural Networks

The preliminary results of GrooveNet, a generative system that learns to synthesize dance movements for a given audio track in real-time, are presented and the plans to further develop Groove net are outlined.

Auto-Conditioned Recurrent Networks for Extended Complex Human Motion Synthesis

This work is the first to the knowledge that demonstrates the ability to generate over 18,000 continuous frames (300 seconds) of new complex human motion w.r.t. different styles.

The Future is Now: Live Breakdance Battles in VR Are Connecting People Across the Globe

  • 2021

The Deep Learning Framework

Multilevel rhythms in multimodal communication

An integrative overview of converging findings that show how multimodal processes occurring at neural, bodily, as well as social interactional levels each contribute uniquely to the complex rhythms that characterize communication in human and non-human animals are addressed.