• Corpus ID: 244114346

Multi-Person 3D Motion Prediction with Multi-Range Transformers

  title={Multi-Person 3D Motion Prediction with Multi-Range Transformers},
  author={Jiashun Wang and Huazhe Xu and Medhini G. Narasimhan and Xiaolong Wang},
We propose a novel framework for multi-person 3D motion trajectory prediction. Our key observation is that a human’s action and behaviors may highly depend on the other persons around. Thus, instead of predicting each human pose trajectory in isolation, we introduce a Multi-Range Transformers model which contains of a local-range encoder for individual motion and a global-range encoder for social interactions. The Transformer decoder then performs prediction for each person by taking a… 

Figures and Tables from this paper

SoMoFormer: Social-Aware Motion Transformer for Multi-Person Motion Prediction

A novel social-aware motion attention mecha- nism in SoMoFormer is devised to further optimize dynamics representations and capture interaction dependencies simultaneously via motion similarity calculation across time and social dimensions.

SoMoFormer: Multi-Person Pose Forecasting with Transformers

This paper presents a new method, called Social Motion Transformer (SoMoFormer), which uniquely models human motion input as a joint sequence rather than a time sequence, allowing it to perform attention over joints while predicting an entire future motion sequence for each joint in parallel.

Motion Transformer with Global Intention Localization and Local Movement Refinement

Motion TRansformer (MTR) framework is proposed that models motion prediction as the joint optimization of global intention localization and local movement refinement and incorporates spatial intention priors by adopting a small set of learnable motion query pairs.

Comparison of Spatio-Temporal Models for Human Motion and Pose Forecasting in Face-to-Face Interaction Scenarios

This work presents the first systematic comparison of state-of-the-art approaches for behavior forecasting by autoregressively predicting the future with methods trained for the short-term future and shows that this finding holds when highly noisy annotations are used, which opens new horizons towards the use of weakly-supervised learning.

MotionCLIP: Exposing Human Motion Generation to CLIP Space

Although CLIP has never seen the motion domain, MotionCLIP offers unprecedented text-to-motion abili-ties, allowing out-of-domain actions, disentangled editing, and abstract language specification.

TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts

This paper aims to explore the generation of 3D human full-body motions from texts, as well as its reciprocal task, shorthanded for text2motion and motion2text, respectively, with the use of motion token, a discrete and compact motion representation.

I^2R-Net: Intra- and Inter-Human Relation Network for Multi-Person Pose Estimation

The proposed Intra- and Inter-Human Relation Networks I²R-Net for Multi-Person Pose Estimation surpasses all the state-of-the-art methods.

Didn't see that coming: a survey on non-verbal social human behavior forecasting

This survey defines the behavior forecasting problem for multiple interactive agents in a generic way that aims at unifying the fields of social signals prediction and human motion forecasting, traditionally separated.

Human-Robot Commensality: Bite Timing Prediction for Robot-Assisted Feeding in Groups

A data-driven models to predict when a robot should feed during social dining scenarios and shows that bite timing strategies that take into account the delicate balance of social cues can lead to seamless interactions during robot-assisted feeding in a social dining scenario.

ChaLearn LAP Challenges on Self-Reported Personality Recognition and Non-Verbal Behavior Forecasting During Social Dyadic Interactions: Dataset, Design, and Results

This paper summarizes the 2021 ChaLearn Looking at People Challenge on Understanding Social Behavior in Dyadic and Small Group Interactions (DYAD), which featured two tracks, self-reported



Socially and Contextually Aware Human Motion and Pose Forecasting

A novel framework to tackle both tasks of human motion (or trajectory) and body skeleton pose forecasting in a unified end-to-end pipeline is proposed and achieves a superior performance compared to several baselines on two social datasets.

TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild

A novel TRajectory and POse Dynamics (nicknamed TRiPOD) method based on graph attentional networks to model the human-human and human-object interactions both in the input space and the output space (decoded future output).

Human Motion Prediction via Spatio-Temporal Inpainting

This work argues that the L2 metric, considered so far by most approaches, fails to capture the actual distribution of long-term human motion, and proposes two alternative metrics, based on the distribution of frequencies, that are able to capture more realistic motion patterns.

Learning Trajectory Dependencies for Human Motion Prediction

A simple feed-forward deep network for motion prediction, which takes into account both temporal smoothness and spatial dependencies among human body joints, and design a new graph convolutional network to learn graph connectivity automatically.

Long-term Human Motion Prediction with Scene Context

This work proposes a novel three-stage framework that exploits scene context to tackle the task of predicting human motion and shows consistent quantitative and qualitative improvements over existing methods.

Predicting 3D Human Dynamics From Video

This work presents perhaps the first approach for predicting a future 3D mesh model sequence of a person from past video input, and inspired by the success of autoregressive models in language modeling tasks, learns an intermediate latent space on which to predict the future.

A Neural Temporal Model for Human Motion Prediction

A novel metric, called Normalized Power Spectrum Similarity (NPSS), is proposed, to evaluate the long-term predictive ability of motion synthesis models, complementing the popular mean-squared error (MSE) measure of Euler joint angles over time.

History Repeats Itself: Human Motion Prediction via Motion Attention

An attention-based feed-forward network is introduced that explicitly leverages the observation that human motion tends to repeat itself to capture motion attention to capture the similarity between the current motion context and the historical motion sub-sequences.

Adversarial Geometry-Aware Human Motion Prediction

This work proposes a novel frame-wise geodesic loss as a geometrically meaningful, more precise distance measurement and presents a new learning procedure to simultaneously validate the sequence-level plausibility of the prediction and its coherence with the input sequence by introducing two global recurrent discriminators.

We are More than Our Joints: Predicting how 3D Bodies Move

MOJO (More than The authors' JOints), which is a novel variational autoencoder with a latent DCT space that generates motions from latent frequencies, is trained, which preserves the full temporal resolution of the input motion, and sampling from the latent frequencies explicitly introduces high-frequency components into the generated motion.