Transflower: probabilistic autoregressive dance generation with multimodal attention

  title={Transflower: probabilistic autoregressive dance generation with multimodal attention},
  author={Guillermo Valle P{\'e}rez and Gustav Eje Henter and Jonas Beskow and Andr{\'e} Holzapfel and Pierre-Yves Oudeyer and Simon Alexanderson},
  journal={ACM Trans. Graph.},
composition of movements follow rhythmic, tonal and timbral features of music. Formally, generating dance conditioned on a piece of music can be expressed as a problem of modelling a high-dimensional continuous motion signal, conditioned on an audio signal. In this work we make two contributions to tackle this problem. First, we present a novel probabilistic autoregressive architecture that models the distribution over future poses with a normalizing flow conditioned on previous poses as well… 

Figures and Tables from this paper

Dance Style Transfer with Cross-modal Transformer

The method extends an existing CycleGAN architecture for modeling audio sequences and integrates multimodal transformer encoders to account for music context and adopts sequence length-based curriculum learning to stabilize training.

ChoreoGraph: Music-conditioned Automatic Dance Choreography over a Style and Tempo Consistent Dynamic Graph

ChoreoGraph is proposed, which choreographs high-quality dance motion for a given piece of music over a Dynamic Graph, which outperforms other baseline models and can generate motions with aesthetic consistency and robustly extensible in diversity.

DeepPhase: periodic autoencoders for learning motion phase manifolds

It is demonstrated that the learned periodic embedding can significantly help to improve neural motion synthesis in a number of tasks, including diverse locomotion skills, style-based move- ments, dance motion synthesis from music, synthesis of dribbling motions in football, and motion query for matching poses within large animation databases.

UDE: A Unified Driving Engine for Human Motion Generation

This paper proposes UDE, the first unified driving engine that enables generating human motion sequences from natural language or audio sequences and evaluates the method on HumanML3D and AIST++ benchmarks, and the experiment results demonstrate the method achieves state-of-the-art performance.

EDGE: Editable Dance Generation From Music

This work introduces Editable Dance GEneration (EDGE), a state-of-the-art method for editable dance generation that is capable of creating realistic, physically-plausible dances while remaining faithful to the input music.

Listen, denoise, action! Audio-driven motion synthesis with diffusion models

Diffusion models are shown to be an excellent model for synthesising human motion that co-occurs with audio, for example co-speech gesticulation, since motion is complex and highly ambiguous given audio, calling for a probabilistic description.

Exemplar-based Stylized Gesture Generation from Speech: An Entry to the GENEA Challenge 2022

The model is a neural network that generates gesture animation from an input audio file that is embedded in a latent space using a variational framework, addressing the stochastic nature of gesture motion.

Rhythmic Gesticulator

A novel co-speech gesture synthesis method that achieves convincing results both on the rhythm and semantics, and builds correspondence between the hierarchical embeddings of the speech and the motion, resulting in rhythm- and semantics-aware gesture synthesis.

Multi-Scale Cascaded Generator for Music-driven Dance Synthesis

A multi-scale cascaded music-driven dance synthesis network (MC-MDSN) that first generates big-scale body motions conditioned on music and then further refines local small-scale joint motions and design a multi- scale feature loss to capture the dynamic characteristics of each scale motions and the relations between different scale motion joints.

ZeroEGGS: Zero-shot Example-based Gesture Generation from Speech

We present ZeroEGGS, a neural network framework for speech-driven gesture generation with zero-shot style control by example. This means style can be controlled via only a short example motion clip,



Learn to Dance with AIST++: Music Conditioned 3D Dance Generation

A transformer-based learning framework for 3D dance generation conditioned on music that combines a deep crossmodal transformer, which well learns the correlation between the music and dance motion; and the full-attention with future-N supervision mechanism which is essential in producing long-range non-freezing motion.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

MoGlow: Probabilistic and controllable motion synthesis using normalising flows

Data-driven modelling and synthesis of motion is an active research area with applications that include animation, games, and social robotics. This paper introduces a new class of probabilistic,

DanceNet3D: Music Based Dance Generation with Parametric Motion Transformer

Extensive experiments demonstrate that the proposed approach can generate fluent, elegant, performative and beat-synchronized 3D dances, which significantly surpasses previous works quantitatively and qualitatively.

Predicting Video with VQVAE

This paper proposes a novel approach to video prediction with Vector Quantized Variational AutoEncoders (VQ-VAE), which compress high-resolution videos into a hierarchical set of multi-scale discrete latent variables, allowing it to apply scalable autoregressive generative models to predict video.

GrooveNet : Real-Time Music-Driven Dance Movement Generation using Artificial Neural Networks

The preliminary results of GrooveNet, a generative system that learns to synthesize dance movements for a given audio track in real-time, are presented and the plans to further develop Groove net are outlined.

Auto-Conditioned Recurrent Networks for Extended Complex Human Motion Synthesis

This work is the first to the knowledge that demonstrates the ability to generate over 18,000 continuous frames (300 seconds) of new complex human motion w.r.t. different styles.

The Future is Now: Live Breakdance Battles in VR Are Connecting People Across the Globe

  • 2021

The Deep Learning Framework

Multilevel rhythms in multimodal communication

An integrative overview of converging findings that show how multimodal processes occurring at neural, bodily, as well as social interactional levels each contribute uniquely to the complex rhythms that characterize communication in human and non-human animals are addressed.