Action2video: Generating Videos of Human 3D Actions

  title={Action2video: Generating Videos of Human 3D Actions},
  author={Chuan Guo and Xinxin Zuo and Sen Wang and Xinshuang Liu and Shihao Zou and Minglun Gong and Li Cheng},
  journal={International Journal of Computer Vision},
  • Chuan GuoX. Zuo Li Cheng
  • Published 12 November 2021
  • Computer Science
  • International Journal of Computer Vision
We aim to tackle the interesting yet challenging problem of generating videos of diverse and natural human motions from prescribed action categories. The key issue lies in the ability to synthesize multiple distinct motion sequences that are realistic in their visual appearances. It is achieved in this paper by a two-step process that maintains internal 3D pose and shape representations, action2motion and motion2video . Action2motion stochastically generates plausible 3D pose sequences of a… 

Generating Diverse and Natural 3D Human Motions from Text

  • Chuan GuoShihao Zou Li Cheng
  • Computer Science
    2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2022
This work proposes motion snippet code as an internal motion representation, which captures local semantic motion contexts and is empirically shown to facilitate the generation of plausible motions faithful to the input text.

TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts

This paper aims to explore the generation of 3D human full-body motions from texts, as well as its reciprocal task, shorthanded for text2motion and motion2text, respectively, with the use of motion token, a discrete and compact motion representation.

Modiff: Action-Conditioned 3D Motion Generation with Denoising Diffusion Probabilistic Models

Modiff is a pioneering attempt that uses DDPM to synthesize a variable number of motion sequences conditioned on a categorical action and evaluates the approach on the large-scale NTU RGB+D dataset and shows improvements over state-of-the-art motion generation methods.

CLIP-Actor: Text-Driven Recommendation and Stylization for Animating Human Meshes

It is demonstrated that CLIP-Actor produces plausible and human-recognizable style 3D human mesh in motion with detailed geometry and texture solely from a natural language prompt.

FLAG3D: A 3D Fitness Activity Dataset with Language Instruction

FLAG3D is presented, a large-scale 3D 3D activity dataset with language instruction containing 180K sequences of 60 categories that contributes great research value for various challenges, such as cross-domain human action recognition, dynamic human mesh recovery, and language-guided human action generation.



VIBE: Video Inference for Human Body Pose and Shape Estimation

This work defines a novel temporal network architecture with a self-attention mechanism and shows that adversarial training, at the sequence level, produces kinematically plausible motion sequences without in-the-wild ground-truth 3D labels.

Pose Guided Human Video Generation

A pose guided method to synthesize human videos in a disentangled way: plausible motion prediction and coherent appearance generation and enforce semantic consistency between the generated and ground-truth poses at a high feature level.

Liquid Warping GAN: A Unified Framework for Human Motion Imitation, Appearance Transfer and Novel View Synthesis

A 3D body mesh recovery module is proposed to disentangle the pose and shape, which can not only model the joint location and rotation but also characterize the personalized body shape and is able to support a more flexible warping from multiple sources.

Hierarchical Style-based Networks for Motion Synthesis

This paper proposes a self-supervised method for generating long-range, diverse and plausible behaviors to achieve a specific goal location through bi-linear transformation modelling, and demonstrates the generated sequences are useful as subgoals for actual physical execution in the animated world.

Deep Video Generation, Prediction and Completion of Human Action Sequences

This paper proposes a general, two-stage deep framework to generate human action videos with no constraints or arbitrary number of constraints, which uniformly address the three problems: video generation given no input frames, video prediction given the first few frames, and video completionGiven the first and last frames.

Animating Arbitrary Objects via Deep Motion Transfer

This paper introduces a novel deep learning framework for image animation that generates a video in which the target object is animated according to the driving sequence through a deep architecture that decouples appearance and motion information.

Convolutional Sequence Generation for Skeleton-Based Action Synthesis

The results show that the proposed framework, named Convolutional Sequence Generation Network (CSGN), can produce long action sequences that are coherent across time steps and among body parts.

View invariant human action recognition using histograms of 3D joints

This paper presents a novel approach for human action recognition with histograms of 3D joint locations (HOJ3D) as a compact representation of postures and achieves superior results on the challenging 3D action dataset.

Character animation from 2D pictures and 3D motion data

A new method to animate photos of 2D characters using 3D motion capture data that correctly handles projective shape distortion and works for images from arbitrary views and requires only a small amount of user interaction.

Everybody Dance Now

This paper presents a simple method for “do as I do” motion transfer: given a source video of a person dancing, we can transfer that performance to a novel (amateur) target after only a few minutes