Multi-Person 3D Motion Prediction with Multi-Range Transformers
@article{Wang2021MultiPerson3M, title={Multi-Person 3D Motion Prediction with Multi-Range Transformers}, author={Jiashun Wang and Huazhe Xu and Medhini G. Narasimhan and Xiaolong Wang}, journal={ArXiv}, year={2021}, volume={abs/2111.12073} }
We propose a novel framework for multi-person 3D motion trajectory prediction. Our key observation is that a human’s action and behaviors may highly depend on the other persons around. Thus, instead of predicting each human pose trajectory in isolation, we introduce a Multi-Range Transformers model which contains of a local-range encoder for individual motion and a global-range encoder for social interactions. The Transformer decoder then performs prediction for each person by taking a…
10 Citations
SoMoFormer: Social-Aware Motion Transformer for Multi-Person Motion Prediction
- Computer ScienceArXiv
- 2022
A novel social-aware motion attention mecha- nism in SoMoFormer is devised to further optimize dynamics representations and capture interaction dependencies simultaneously via motion similarity calculation across time and social dimensions.
SoMoFormer: Multi-Person Pose Forecasting with Transformers
- Computer ScienceArXiv
- 2022
This paper presents a new method, called Social Motion Transformer (SoMoFormer), which uniquely models human motion input as a joint sequence rather than a time sequence, allowing it to perform attention over joints while predicting an entire future motion sequence for each joint in parallel.
Motion Transformer with Global Intention Localization and Local Movement Refinement
- Computer ScienceArXiv
- 2022
Motion TRansformer (MTR) framework is proposed that models motion prediction as the joint optimization of global intention localization and local movement refinement and incorporates spatial intention priors by adopting a small set of learnable motion query pairs.
Comparison of Spatio-Temporal Models for Human Motion and Pose Forecasting in Face-to-Face Interaction Scenarios
- Computer ScienceDYAD@ICCV
- 2021
This work presents the first systematic comparison of state-of-the-art approaches for behavior forecasting by autoregressively predicting the future with methods trained for the short-term future and shows that this finding holds when highly noisy annotations are used, which opens new horizons towards the use of weakly-supervised learning.
MotionCLIP: Exposing Human Motion Generation to CLIP Space
- Computer ScienceECCV
- 2022
Although CLIP has never seen the motion domain, MotionCLIP offers unprecedented text-to-motion abili-ties, allowing out-of-domain actions, disentangled editing, and abstract language specification.
TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts
- Computer ScienceECCV
- 2022
This paper aims to explore the generation of 3D human full-body motions from texts, as well as its reciprocal task, shorthanded for text2motion and motion2text, respectively, with the use of motion token, a discrete and compact motion representation.
I^2R-Net: Intra- and Inter-Human Relation Network for Multi-Person Pose Estimation
- Computer ScienceIJCAI
- 2022
The proposed Intra- and Inter-Human Relation Networks I²R-Net for Multi-Person Pose Estimation surpasses all the state-of-the-art methods.
Didn't see that coming: a survey on non-verbal social human behavior forecasting
- Computer ScienceDYAD@ICCV
- 2021
This survey defines the behavior forecasting problem for multiple interactive agents in a generic way that aims at unifying the fields of social signals prediction and human motion forecasting, traditionally separated.
Human-Robot Commensality: Bite Timing Prediction for Robot-Assisted Feeding in Groups
- Computer ScienceArXiv
- 2022
A data-driven models to predict when a robot should feed during social dining scenarios and shows that bite timing strategies that take into account the delicate balance of social cues can lead to seamless interactions during robot-assisted feeding in a social dining scenario.
ChaLearn LAP Challenges on Self-Reported Personality Recognition and Non-Verbal Behavior Forecasting During Social Dyadic Interactions: Dataset, Design, and Results
- Computer Science, PsychologyDYAD@ICCV
- 2021
This paper summarizes the 2021 ChaLearn Looking at People Challenge on Understanding Social Behavior in Dyadic and Small Group Interactions (DYAD), which featured two tracks, self-reported…
References
SHOWING 1-10 OF 77 REFERENCES
Socially and Contextually Aware Human Motion and Pose Forecasting
- Computer ScienceIEEE Robotics and Automation Letters
- 2020
A novel framework to tackle both tasks of human motion (or trajectory) and body skeleton pose forecasting in a unified end-to-end pipeline is proposed and achieves a superior performance compared to several baselines on two social datasets.
TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild
- Computer Science2021 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2021
A novel TRajectory and POse Dynamics (nicknamed TRiPOD) method based on graph attentional networks to model the human-human and human-object interactions both in the input space and the output space (decoded future output).
Human Motion Prediction via Spatio-Temporal Inpainting
- Computer Science2019 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2019
This work argues that the L2 metric, considered so far by most approaches, fails to capture the actual distribution of long-term human motion, and proposes two alternative metrics, based on the distribution of frequencies, that are able to capture more realistic motion patterns.
Learning Trajectory Dependencies for Human Motion Prediction
- Computer Science2019 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2019
A simple feed-forward deep network for motion prediction, which takes into account both temporal smoothness and spatial dependencies among human body joints, and design a new graph convolutional network to learn graph connectivity automatically.
Long-term Human Motion Prediction with Scene Context
- Computer ScienceECCV
- 2020
This work proposes a novel three-stage framework that exploits scene context to tackle the task of predicting human motion and shows consistent quantitative and qualitative improvements over existing methods.
Predicting 3D Human Dynamics From Video
- Computer Science2019 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2019
This work presents perhaps the first approach for predicting a future 3D mesh model sequence of a person from past video input, and inspired by the success of autoregressive models in language modeling tasks, learns an intermediate latent space on which to predict the future.
A Neural Temporal Model for Human Motion Prediction
- Computer Science2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2019
A novel metric, called Normalized Power Spectrum Similarity (NPSS), is proposed, to evaluate the long-term predictive ability of motion synthesis models, complementing the popular mean-squared error (MSE) measure of Euler joint angles over time.
History Repeats Itself: Human Motion Prediction via Motion Attention
- Computer ScienceECCV
- 2020
An attention-based feed-forward network is introduced that explicitly leverages the observation that human motion tends to repeat itself to capture motion attention to capture the similarity between the current motion context and the historical motion sub-sequences.
Adversarial Geometry-Aware Human Motion Prediction
- Computer ScienceECCV
- 2018
This work proposes a novel frame-wise geodesic loss as a geometrically meaningful, more precise distance measurement and presents a new learning procedure to simultaneously validate the sequence-level plausibility of the prediction and its coherence with the input sequence by introducing two global recurrent discriminators.
We are More than Our Joints: Predicting how 3D Bodies Move
- Computer Science2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2021
MOJO (More than The authors' JOints), which is a novel variational autoencoder with a latent DCT space that generates motions from latent frequencies, is trained, which preserves the full temporal resolution of the input motion, and sampling from the latent frequencies explicitly introduces high-frequency components into the generated motion.