Deep High-Resolution Representation Learning for Visual Recognition

@article{Wang2019DeepHR,
  title={Deep High-Resolution Representation Learning for Visual Recognition},
  author={Jingdong Wang and Ke Sun and Tianheng Cheng and Borui Jiang and Chaorui Deng and Yang Zhao and Dong Liu and Yadong Mu and Mingkui Tan and Xinggang Wang and Wenyu Liu and Bin Xiao},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2019},
  volume={43},
  pages={3349-3364}
}
  • Jingdong WangKe Sun Bin Xiao
  • Published 20 August 2019
  • Computer Science
  • IEEE Transactions on Pattern Analysis and Machine Intelligence
High-resolution representations are essential for position-sensitive vision problems, such as human pose estimation, semantic segmentation, and object detection. [] Key Method There are two key characteristics: (i) Connect the high-to-low resolution convolution streams \emph{in parallel}; (ii) Repeatedly exchange the information across resolutions. The benefit is that the resulting representation is semantically richer and spatially more precise. We show the superiority of the proposed HRNet in a wide range…

HROM: Learning High-Resolution Representation and Object-Aware Masks for Visual Object Tracking

A novel high-resolution Siamese network is proposed, which connects the high-to-low resolution convolution streams in parallel as well as repeatedly exchanges the information across resolutions to maintain high- resolution representations.

HRFormer: High-Resolution Transformer for Dense Prediction

We present a High-Resolution Transformer (HRFormer) that learns high-resolution representations for dense prediction tasks, in contrast to the original Vision Transformer that produces low-resolution

U-HRNet: Delving into Improving Semantic Representation of High Resolution Network for Dense Prediction

A U-shaped High-Resolution Network (U-HRNet), which adds more stages after the feature map with strongest semantic representation and relaxes the constraint in HRNet that all resolutions need to be calculated parallel for a newly added stage.

Generic Perceptual Loss for Modeling Structured Output Dependencies

It is demonstrated that a randomly-weighted deep CNN can be used to model the structured dependencies of outputs and removed the requirements of pre-training and a particular network structure (commonly, VGG) that are previously assumed for the perceptual loss.

Estimating Human Pose Efficiently by Parallel Pyramid Networks

This paper designs a novel network architecture for human pose estimation, which aims to strike a fine balance between speed and accuracy, and refers to the architecture as “parallel pyramid” network (PPNet), as features of different resolutions are processed at different levels of the hierarchical model.

MTPose: Human Pose Estimation with High-Resolution Multi-scale Transformers

This paper proposes a human pose estimation framework built upon High-Resolution Multi-scale Transformers, termed MTPose, and combines the two advantages of high-resolution and Transformers together to improve the performance.

Deep High-Resolution Representation Learning for Cross-Resolution Person Re-Identification

A Deep High-Resolution Pseudo-Siamese Framework (PS-HRNet) is proposed to solve the problem of matching person images with the same identity from different cameras, and a pseudo-siamese framework is developed to reduce the difference of feature distributions between low- resolution images and high-resolution images.

Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scenes

Novel deep dual-resolution networks (DDRNets) are proposed for real-time semantic segmentation of road scenes and a new contextual information extractor named Deep Aggregation Pyramid Pooling Module (DAPPM) is designed to enlarge effective receptive fields and fuse multi-scale context.

MPViT: Multi-Path Vision Transformer for Dense Prediction

This work explores multi-scale patch embedding and multi-path structure, constructing the Multi-Path Vision Transformer (MPViT), which consistently achieve superior performance over state-of-the-art Vision Transformers on ImageNet classification, object detection, instance segmentation, and semantic segmentation.

Representation Separation for Semantic Segmentation with Vision Transformers

Ancient framework of representation separation in local-patch level and global-region level for semantic segmentation with ViTs is presented, targeted for the peculiar over-smoothness of ViTs in semantic segmentsation.
...

References

SHOWING 1-10 OF 195 REFERENCES

High-Resolution Representations for Labeling Pixels and Regions

A simple modification is introduced to augment the high-resolution representation by aggregating the (upsampled) representations from all the parallel convolutions rather than only the representation from thehigh-resolution convolution, which leads to stronger representations, evidenced by superior results.

Deep High-Resolution Representation Learning for Human Pose Estimation

This paper proposes a network that maintains high-resolution representations through the whole process of human pose estimation and empirically demonstrates the effectiveness of the network through the superior pose estimation results over two benchmark datasets: the COCO keypoint detection dataset and the MPII Human Pose dataset.

RefineNet: Multi-path Refinement Networks for High-Resolution Semantic Segmentation

RefineNet is presented, a generic multi-path refinement network that explicitly exploits all the information available along the down-sampling process to enable high-resolution prediction using long-range residual connections and introduces chained residual pooling, which captures rich background context in an efficient manner.

Full-Resolution Residual Networks for Semantic Segmentation in Street Scenes

This work proposes a novel ResNet-like architecture that exhibits strong localization and recognition performance, and combines multi-scale context with pixel-level accuracy by using two processing streams within the network.

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

This work addresses the task of semantic image segmentation with Deep Learning and proposes atrous spatial pyramid pooling (ASPP), which is proposed to robustly segment objects at multiple scales, and improves the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models.

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

This work extends DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries and applies the depthwise separable convolution to both Atrous Spatial Pyramid Pooling and decoder modules, resulting in a faster and stronger encoder-decoder network.

Recurrent Scene Parsing with Perspective Understanding in the Loop

This work proposes a depth-aware gating module that adaptively selects the pooling field size in a convolutional network architecture according to the object scale so that small details are preserved for distant objects while larger receptive fields are used for those nearby.

Laplacian Pyramid Reconstruction and Refinement for Semantic Segmentation

A multi-resolution reconstruction architecture based on a Laplacian pyramid that uses skip connections from higher resolution feature maps and multiplicative gating to successively refine segment boundaries reconstructed from lower-resolution maps is described.

Deep Feature Pyramid Reconfiguration for Object Detection

A novel reconfiguration architecture is proposed to combine low-level representations with high-level semantic features in a highly-nonlinear yet efficient way to gather task-oriented features across different spatial locations and scales, globally and locally.

Attention to Scale: Scale-Aware Semantic Image Segmentation

An attention mechanism that learns to softly weight the multi-scale features at each pixel location is proposed, which not only outperforms averageand max-pooling, but allows us to diagnostically visualize the importance of features at different positions and scales.
...