BABEL: Bodies, Action and Behavior with English Labels

  title={BABEL: Bodies, Action and Behavior with English Labels},
  author={Abhinanda R. Punnakkal and Arjun Chandrasekaran and Nikos Athanasiou and Alejandra Quiros-Ramirez and Michael J. Black Max Planck Institute for Intelligent Systems and Universit{\"a}t Konstanz},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
Understanding the semantics of human movement – the what, how and why of the movement – is an important problem that requires datasets of human actions with semantic labels. Existing datasets take one of two approaches. Large-scale video datasets contain many action labels but do not contain ground-truth 3D human motion. Alternatively, motion-capture (mocap) datasets have precise body motions but are limited to a small number of actions. To address this, we present BABEL, a large dataset with… 

Figures and Tables from this paper

Weakly-supervised Action Transition Learning for Stochastic Human Motion Prediction

The task of action-driven stochastic human motion prediction is introduced, which aims to predict multiple plausible future motions given a sequence of action labels and a short motion history, and a VAE-based model conditioned on both the observed motion and the action label sequence is designed.

TEACH: Temporal Action Composition for 3D Humans

An approach to enable the synthesis of a series of actions, called TEACH for “TEmporal Action Compositions for Human motions”, produces realistic human motions for a wide variety of actions and temporal compositions from language descriptions.

OhMG: Zero-shot Open-vocabulary Human Motion Generation

Extensive experiments show that the proposed controllable andexible motion generation framework can generate better text-consistent poses and motions across various baselines and metrics.

HUMANISE: Language-conditioned Human Motion Generation in 3D Scenes

This work presents a novel scene-and-language conditioned generative model that can produce 3D human motions of the desirable action interacting with the specified objects and demonstrates that the model generates diverse and semantically consistent human motions in 3D scenes.

PoseScript: 3D Human Poses from Natural Language

This work introduces the PoseScript dataset, which pairs a few thousand 3D human poses from AMASS with rich human-annotated descriptions of the body parts and their spatial relationships, and proposes an elaborate captioning process that generates automatic synthetic descriptions in natural language from given 3D keypoints.

NEURAL MARIONETTE: A Transformer-based Multi-action Human Motion Synthesis System

A neural network-based system for long-term, multi-action human motion synthesis that can produce high-quality and meaningful motions with smooth transitions from simple user input, including a sequence of action tags with expected action duration, and optionally a hand-drawn moving trajectory if the user specifies.

Generating Diverse and Natural 3D Human Motions from Text

  • Chuan GuoShihao Zou Li Cheng
  • Computer Science
    2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2022
This work proposes motion snippet code as an internal motion representation, which captures local semantic motion contexts and is empirically shown to facilitate the generation of plausible motions faithful to the input text.

Learning Joint Representation of Human Motion and Language

This work proposes a motion-language model with contrastive learning, empowering the model to learn better generalizable representations of the human motion domain, and empirical results show that the model learns strong representations of human motion data through navigating language modality.

Learning Uncoupled-Modulation CVAE for 3D Action-Conditioned Human Motion Synthesis

The Uncoupled-Modulation Conditional Variational AutoEncoder (UM-CVAE) is proposed to generate action-conditioned motions from scratch in an uncoupled manner and achieves state-of-the-art performance both qualitatively and quantitatively with potential applications.

MotionCLIP: Exposing Human Motion Generation to CLIP Space

Although CLIP has never seen the motion domain, MotionCLIP offers unprecedented text-to-motion abili-ties, allowing out-of-domain actions, disentangled editing, and abstract language specification.



Action2Motion: Conditioned Generation of 3D Human Motions

This paper aims to generate plausible human motion sequences in 3D given a prescribed action type, and proposes a temporal Variational Auto-Encoder (VAE) that encourages a diverse sampling of the motion space.

Watch-n-patch: Unsupervised understanding of actions and relations

The model learns the high-level action co-occurrence and temporal relations between the actions in the activity video and is applied to unsupervised action segmentation and recognition, and also to a novel application that detects forgotten actions, which is called action patching.

Going deeper into action recognition: A survey

HACS: Human Action Clips and Segments Dataset for Recognition and Temporal Localization

On HACS Segments, the state-of-the-art methods of action proposal generation and action localization are evaluated, and the new challenges posed by the dense temporal annotations are highlighted.

AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions

The AVA dataset densely annotates 80 atomic visual actions in 437 15-minute video clips, where actions are localized in space and time, resulting in 1.59M action labels with multiple labels per person occurring frequently.

The KIT whole-body human motion database

We present a large-scale whole-body human motion database consisting of captured raw motion data as well as the corresponding post-processed motions. This database serves as a key element for a wide

Language2Pose: Natural Language Grounded Pose Forecasting

This paper introduces a neural architecture called Joint Language-to-Pose (or JL2P), which learns a joint embedding of language and pose and evaluates the proposed model on a publicly available corpus of 3D pose data and human-annotated sentences.

Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos

A novel variant of long short-term memory deep networks is defined for modeling these temporal relations via multiple input and output connections and it is shown that this model improves action labeling accuracy and further enables deeper understanding tasks ranging from structured retrieval to action prediction.

The THUMOS challenge on action recognition for videos "in the wild"

Human Motion Anticipation with Symbolic Label

This work approximate a person's intention via a symbolic representation, for example fine-grained action labels such as walking or sitting down, by first anticipating symbolic labels and then generating human motion, conditioned on the human motion input sequence as well as on the forecast labels.