Programmatic Concept Learning for Human Motion Description and Synthesis

@article{Kulal2022ProgrammaticCL,
  title={Programmatic Concept Learning for Human Motion Description and Synthesis},
  author={Sumith Kulal and Jiayuan Mao and Alex Aiken and Jiajun Wu},
  journal={ArXiv},
  year={2022},
  volume={abs/2206.13502}
}
We introduce Programmatic Motion Concepts, a hierarchical motion representation for human actions that cap-tures both low-level motion and high-level description as motion concepts. This representation enables human motion description, interactive editing, and controlled synthesis of novel video sequences within a single framework. We present an architecture that learns this concept representation from paired video and action sequences in a semi-supervised manner. The compactness of our… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 62 REFERENCES

A Recurrent Variational Autoencoder for Human Motion Synthesis

TLDR
A novel generative model of human motion that can be trained using a large motion capture dataset, and allows users to produce animations from high-level control signals is proposed, and can predict the movements of the human body over long horizons more accurately than state-of-the-art methods.

Hierarchical Motion Understanding via Motion Programs

TLDR
Motion Programs is introduced, a neuro-symbolic, program-like representation that expresses motions as a composition of high-level primitives that benefits downstream tasks such as video interpolation and video prediction and outperforms off-the-shelf models.

Action-Conditioned 3D Human Motion Synthesis with Transformer VAE

TLDR
This work designs a Transformer-based architecture, ACTOR, for encoding and decoding a sequence of parametric SMPL human body models estimated from action recognition datasets, and learns an action-aware latent representation for human motions by training a generative variational autoencoder (VAE).

Rethinking the Faster R-CNN Architecture for Temporal Action Localization

TLDR
TAL-Net is proposed, an improved approach to temporal action localization in video that is inspired by the Faster RCNN object detection framework and achieves state-of-the-art performance for both action proposal and localization on THUMOS'14 detection benchmark and competitive performance on ActivityNet challenge.

On Human Motion Prediction Using Recurrent Neural Networks

TLDR
It is shown that, surprisingly, state of the art performance can be achieved by a simple baseline that does not attempt to model motion at all, and a simple and scalable RNN architecture is proposed that obtains state-of-the-art performance on human motion prediction.

Generating Animated Videos of Human Activities from Natural Language Descriptions

TLDR
This paper introduces a system that maps a natural language description to an animation of a humanoid skeleton that is a sequence-to-sequence model that is pretrained with an autoencoder objective and then trained end- to-end.

Action2Motion: Conditioned Generation of 3D Human Motions

TLDR
This paper aims to generate plausible human motion sequences in 3D given a prescribed action type, and proposes a temporal Variational Auto-Encoder (VAE) that encourages a diverse sampling of the motion space.

Connectionist Temporal Modeling for Weakly Supervised Action Labeling

TLDR
The Extended Connectionist Temporal Classification (ECTC) framework is introduced to efficiently evaluate all possible alignments via dynamic programming and explicitly enforce their consistency with frame-to-frame visual similarities.

First Order Motion Model for Image Animation

TLDR
This framework decouple appearance and motion information using a self-supervised formulation and uses a representation consisting of a set of learned keypoints along with their local affine transformations to support complex motions.

Language2Pose: Natural Language Grounded Pose Forecasting

TLDR
This paper introduces a neural architecture called Joint Language-to-Pose (or JL2P), which learns a joint embedding of language and pose and evaluates the proposed model on a publicly available corpus of 3D pose data and human-annotated sentences.
...