Differentiable Multi-Granularity Human Representation Learning for Instance-Aware Human Semantic Parsing

  title={Differentiable Multi-Granularity Human Representation Learning for Instance-Aware Human Semantic Parsing},
  author={Tianfei Zhou and Wenguan Wang and Si Liu and Yi Yang and Luc Van Gool},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  • Tianfei ZhouWenguan Wang L. Gool
  • Published 8 March 2021
  • Computer Science
  • 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
To address the challenging task of instance-aware human part parsing, a new bottom-up regime is proposed to learn category-level human semantic segmentation as well as multi-person pose estimation in a joint and end-to-end manner. It is a compact, efficient and powerful framework that exploits structural information over different human granularities and eases the difficulty of person partitioning. Specifically, a dense-to-sparse projection field, which allows explicitly associating dense human… 

Figures and Tables from this paper

Heterogeneous Interactive Attention Network for Human Parsing

A Heterogeneous Interactive Attention Network (HIANet), in which the attention between heterogeneous data is exploited to capture long-distance context dependence and the supplementary cues with plentiful interaction can mutually guide multi-source features to correct their respective prediction errors, further refine the result of human parsing.

Learning Equivariant Segmentation with Instance-Unique Querying

A new training framework that boosts query-based models through discriminative query embedding learning, and encourages both image (instance) representations and queries to be equivariant against geometric transformations, leading to more robust, instance-query matching.

ReSTiNet: On Improving the Performance of Tiny-YOLO-Based CNN Architecture for Applications in Human Detection

The proposed ReSTiNet is a novel compressed convolutional neural network that addresses the issues of size, detection speed, and accuracy, and improves the following features: faster detection speed; compact model size; solving the overfitting problems; and superior performance than other lightweight models such as MobileNet and SqueezeNet in terms of mAP.

PointScatter: Point Set Representation for Tubular Structure Extraction

The PointScatter is proposed, an alternative to the segmentation models for the tubular structure extraction task that splits the image into scatter regions and parallelly predicts points for each scatter region, and the greedy-based region-wise bipartite matching algorithm is proposed to train the network end-to-end and efficiently.

A Semi-Supervised Learning Approach for Automatic Detection and Fashion Product Category Prediction with Small Training Dataset Using FC-YOLOv4

Experimental findings from the FC-YOLOv4 model demonstrate that it can effectively provide accurate fashion category detection for properly captured and clutter images compared to the YOLOV4 and YOLov3 models.

Sensor-Based Hand Gesture Detection and Recognition by Key Intervals

Experimental results reveal that the proposed algorithm provides an effective alternative for applications where accurate detection and classification of hand gestures by simple networks are desired.

AGS-SSD: Attention-Guided Sampling for 3D Single-Stage Detector

An attention-guided downsampling method for point-cloud-based 3D object detection, named AGS-SSD, which achieves significant improvements with novel architectures against the baseline and runs at 24 frames per second for inference.

AIParsing: Anchor-Free Instance-Level Human Parsing

An instance-level human parsing network which is anchor-free and solvable on a pixel level, which achieves the best global-level and instance- level performance over state-of-the-art one-stage top-down alternatives.

Sign and Human Action Detection Using Deep Learning

This study aims to develop an efficient deep learning model that can be used to predict British sign language in an attempt to narrow this communication gap between speech-impaired and non-speech-imPAired people in the community.

Multi-Granularity Regularized Re-Balancing for Class Incremental Learning

An assumption-agnostic method, Multi-Granularity Regularized re-Balancing (MGRB), to address the problem of catastrophic forgetting in deep learning models, and designs a novel multi-granularity regularization term that enables the model to consider the correlations of classes in addition to re-balancing the data.



Understanding Humans in Crowded Scenes: Deep Nested Adversarial Learning and A New Benchmark for Multi-Human Parsing

A new large-scale database "Multi-Human Parsing (MHP)" is presented for algorithm development and evaluation, and NAN consistently outperforms existing state-of-the-art solutions on the MHP and several other datasets, and serves as a strong baseline to drive the future research for multi-human parsing.

Renovating Parsing R-CNN for Accurate Multiple Human Parsing

Renovating Parsing R-CNN is presented, which introduces a global semantic enhanced feature pyramid network and a parsing re-scoring network into the existing high-performance pipeline and regresses a confidence score to represent its quality.

Parsing R-CNN for Instance-Level Human Analysis

This paper presents an end-to-end pipeline for solving the instance-level human analysis, named Parsing R-CNN, which processes a set of human instances simultaneously through comprehensive considering the characteristics of region-based approach and the appearance of a human, thus allowing representing the details of instances.

Joint Multi-person Pose Estimation and Semantic Part Segmentation

This paper proposes to solve the two tasks jointly for natural multi-person images, in which the estimated pose provides object-level shape prior to regularize part segments while the part-level segments constrain the variation of pose locations.

Learning Semantic Neural Tree for Human Parsing

A novel semantic neural tree for human parsing is designed, which uses a tree architecture to encode physiological structure of human body, and designs a coarse to fine process in a cascade manner to generate accurate results.

DMM-Net: Differentiable Mask-Matching Network for Video Object Segmentation

A differentiable matching layer which unrolls a projected gradient descent algorithm in which the projection step exploits the Dykstra's algorithm and it is proved that under mild conditions, the matching is guaranteed to converge to the optimal one.

PifPaf: Composite Fields for Human Pose Estimation

The new PifPaf method, which uses a Part Intensity Field to localize body parts and a Part Association Field to associate body parts with each other to form full human poses, outperforms previous methods at low resolution and in crowded, cluttered and occluded scenes.

Devil in the Details: Towards Accurate Single and Multiple Human Parsing

This paper identifies several useful properties, including feature resolution, global context information and edge details, and performs rigorous analyses to reveal how to leverage them to benefit the human parsing task, resulting in a simple yet effective Context Embedding with Edge Perceiving (CE2P) framework for single human parsing.

Instance-level Human Parsing via Part Grouping Network

This work makes the first attempt to explore a detection-free Part Grouping Network (PGN) for efficiently parsing multiple people in an image in a single pass and outperforms all state-of-the-art methods on PASCAL-Person-Part dataset.

DensePose: Dense Human Pose Estimation in the Wild

This work establishes dense correspondences between an RGB image and a surface-based representation of the human body, a task referred to as dense human pose estimation, and improves accuracy through cascading, obtaining a system that delivers highly-accurate results at multiple frames per second on a single gpu.