# Deep High-Resolution Representation Learning for Visual Recognition

@article{Wang2019DeepHR,
title={Deep High-Resolution Representation Learning for Visual Recognition},
author={Jingdong Wang and Ke Sun and Tianheng Cheng and Borui Jiang and Chaorui Deng and Yang Zhao and Dong Liu and Yadong Mu and Mingkui Tan and Xinggang Wang and Wenyu Liu and Bin Xiao},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
year={2019},
volume={43},
pages={3349-3364}
}
• Published 20 August 2019
• Computer Science
• IEEE Transactions on Pattern Analysis and Machine Intelligence
High-resolution representations are essential for position-sensitive vision problems, such as human pose estimation, semantic segmentation, and object detection. [] Key Method There are two key characteristics: (i) Connect the high-to-low resolution convolution streams \emph{in parallel}; (ii) Repeatedly exchange the information across resolutions. The benefit is that the resulting representation is semantically richer and spatially more precise. We show the superiority of the proposed HRNet in a wide range…
1,337 Citations

## Figures and Tables from this paper

• Computer Science
Sensors
• 2020
A novel high-resolution Siamese network is proposed, which connects the high-to-low resolution convolution streams in parallel as well as repeatedly exchanges the information across resolutions to maintain high- resolution representations.
• Computer Science
ArXiv
• 2021
We present a High-Resolution Transformer (HRFormer) that learns high-resolution representations for dense prediction tasks, in contrast to the original Vision Transformer that produces low-resolution
• Computer Science
ArXiv
• 2022
A U-shaped High-Resolution Network (U-HRNet), which adds more stages after the feature map with strongest semantic representation and relaxes the constraint in HRNet that all resolutions need to be calculated parallel for a newly added stage.
• Computer Science
2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
• 2021
It is demonstrated that a randomly-weighted deep CNN can be used to model the structured dependencies of outputs and removed the requirements of pre-training and a particular network structure (commonly, VGG) that are previously assumed for the perceptual loss.
• Computer Science
IEEE Transactions on Image Processing
• 2021
This paper designs a novel network architecture for human pose estimation, which aims to strike a fine balance between speed and accuracy, and refers to the architecture as “parallel pyramid” network (PPNet), as features of different resolutions are processed at different levels of the hierarchical model.
• Computer Science
Neural Processing Letters
• 2022
This paper proposes a human pose estimation framework built upon High-Resolution Multi-scale Transformers, termed MTPose, and combines the two advantages of high-resolution and Transformers together to improve the performance.
• Computer Science
IEEE Transactions on Image Processing
• 2021
A Deep High-Resolution Pseudo-Siamese Framework (PS-HRNet) is proposed to solve the problem of matching person images with the same identity from different cameras, and a pseudo-siamese framework is developed to reduce the difference of feature distributions between low- resolution images and high-resolution images.
• Computer Science
ArXiv
• 2021
Novel deep dual-resolution networks (DDRNets) are proposed for real-time semantic segmentation of road scenes and a new contextual information extractor named Deep Aggregation Pyramid Pooling Module (DAPPM) is designed to enlarge effective receptive fields and fuse multi-scale context.
• Computer Science
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
• 2022
This work explores multi-scale patch embedding and multi-path structure, constructing the Multi-Path Vision Transformer (MPViT), which consistently achieve superior performance over state-of-the-art Vision Transformers on ImageNet classification, object detection, instance segmentation, and semantic segmentation.
• Computer Science
ArXiv
• 2022
Ancient framework of representation separation in local-patch level and global-region level for semantic segmentation with ViTs is presented, targeted for the peculiar over-smoothness of ViTs in semantic segmentsation.

## References

SHOWING 1-10 OF 195 REFERENCES

• Computer Science
ArXiv
• 2019
A simple modification is introduced to augment the high-resolution representation by aggregating the (upsampled) representations from all the parallel convolutions rather than only the representation from thehigh-resolution convolution, which leads to stronger representations, evidenced by superior results.
• Computer Science
2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
• 2019
This paper proposes a network that maintains high-resolution representations through the whole process of human pose estimation and empirically demonstrates the effectiveness of the network through the superior pose estimation results over two benchmark datasets: the COCO keypoint detection dataset and the MPII Human Pose dataset.
• Computer Science
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
• 2017
RefineNet is presented, a generic multi-path refinement network that explicitly exploits all the information available along the down-sampling process to enable high-resolution prediction using long-range residual connections and introduces chained residual pooling, which captures rich background context in an efficient manner.
• Computer Science
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
• 2017
This work proposes a novel ResNet-like architecture that exhibits strong localization and recognition performance, and combines multi-scale context with pixel-level accuracy by using two processing streams within the network.
• Computer Science
IEEE Transactions on Pattern Analysis and Machine Intelligence
• 2018
This work addresses the task of semantic image segmentation with Deep Learning and proposes atrous spatial pyramid pooling (ASPP), which is proposed to robustly segment objects at multiple scales, and improves the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models.
• Computer Science
ECCV
• 2018
This work extends DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries and applies the depthwise separable convolution to both Atrous Spatial Pyramid Pooling and decoder modules, resulting in a faster and stronger encoder-decoder network.
• Computer Science
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
• 2018
This work proposes a depth-aware gating module that adaptively selects the pooling field size in a convolutional network architecture according to the object scale so that small details are preserved for distant objects while larger receptive fields are used for those nearby.
• Computer Science
ECCV
• 2016
A multi-resolution reconstruction architecture based on a Laplacian pyramid that uses skip connections from higher resolution feature maps and multiplicative gating to successively refine segment boundaries reconstructed from lower-resolution maps is described.
• Computer Science
ECCV
• 2018
A novel reconfiguration architecture is proposed to combine low-level representations with high-level semantic features in a highly-nonlinear yet efficient way to gather task-oriented features across different spatial locations and scales, globally and locally.
• Computer Science
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
• 2016
An attention mechanism that learns to softly weight the multi-scale features at each pixel location is proposed, which not only outperforms averageand max-pooling, but allows us to diagnostically visualize the importance of features at different positions and scales.