The IKEA ASM Dataset: Understanding People Assembling Furniture through Actions, Objects and Pose

  title={The IKEA ASM Dataset: Understanding People Assembling Furniture through Actions, Objects and Pose},
  author={Yizhak Ben-Shabat and Xin Yu and F. Saleh and Dylan Campbell and Cristian Rodriguez-Opazo and Hongdong Li and Stephen Gould},
  journal={2021 IEEE Winter Conference on Applications of Computer Vision (WACV)},
The availability of a large labeled dataset is a key requirement for applying deep learning methods to solve various computer vision tasks. In the context of understanding human activities, existing public datasets, while large in size, are often limited to a single RGB camera and provide only per-frame or per-clip action annotations. To enable richer analysis and understanding of human activities, we introduce IKEA ASM—a three million frame, multi-view, furniture assembly video dataset that… Expand
Video Pose Distillation for Few-Shot, Fine-Grained Sports Action Recognition
Human pose is a useful feature for fine-grained sports action understanding. However, pose estimators are often unreliable when run on sports video due to domain shift and factors such as motion blurExpand
Motion Guided Attention Fusion to Recognize Interactions from Videos
This work introduces separate motion and object detection pathways for recognizing fine-grained interactions from videos, and fuse the bottom-up features in the motion pathway with features captured from object detections to learn the temporal aspects of an action. Expand
Learning by Aligning Videos in Time
We present a self-supervised approach for learning video representations using temporal video alignment as a pretext task, while exploiting both frame-level and video-level information. We leverage aExpand
Human Action Recognition from Various Data Modalities: A Review
This paper reviews both the hand-crafted feature-based and deep learning-based methods for single data modalities and also the methods based on multiple modalities, including the fusion-based frameworks and the co-learning-based approaches for HAR. Expand


DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation
An approach that jointly solves the tasks of detection and pose estimation: it infers the number of persons in a scene, identifies occluded body parts, and disambiguates body parts between people in close proximity of each other is proposed. Expand
Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision
We propose a CNN-based approach for 3D human body pose estimation from single RGB images that addresses the issue of limited generalizability of models trained solely on the starkly limited publiclyExpand
NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding
This work introduces a large-scale dataset for RGB+D human action recognition, which is collected from 106 distinct subjects and contains more than 114 thousand video samples and 8 million frames, and investigates a novel one-shot 3D activity recognition problem on this dataset. Expand
Ordinal Depth Supervision for 3D Human Pose Estimation
This work proposes to use a weaker supervision signal provided by the ordinal depths of human joints, which achieves new state-of-the-art performance for the relevant benchmarks and validate the effectiveness of ordinal depth supervision for 3D human pose. Expand
2D Human Pose Estimation: New Benchmark and State of the Art Analysis
A novel benchmark "MPII Human Pose" is introduced that makes a significant advance in terms of diversity and difficulty, a contribution that is required for future developments in human body models. Expand
Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields
We present an approach to efficiently detect the 2D pose of multiple people in an image. The approach uses a nonparametric representation, which we refer to as Part Affinity Fields (PAFs), to learnExpand
OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields
OpenPose is released, the first open-source realtime system for multi-person 2D pose detection, including body, foot, hand, and facial keypoints, and the first combined body and foot keypoint detector, based on an internal annotated foot dataset. Expand
NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis
A large-scale dataset for RGB+D human action recognition with more than 56 thousand video samples and 4 million frames, collected from 40 distinct subjects is introduced and a new recurrent neural network structure is proposed to model the long-term temporal correlation of the features for each body part, and utilize them for better action classification. Expand
Cascaded Pyramid Network for Multi-person Pose Estimation
A novel network structure called Cascaded Pyramid Network (CPN) is presented which targets to relieve the problem from these "hard" keypoints, with state-of-art results on the COCO keypoint benchmark, with average precision at 73.0. Expand
Human Pose Forecasting via Deep Markov Models
This work proposes a generative framework for poses using variational autoencoders based on Deep Markov Models (DMMs) and evaluates the pose forecasts using a pose-based action classifier, which it is argued better reflects the subjective quality of pose forecasts than distance in coordinate space. Expand