Deep High-Resolution Representation Learning for Visual Recognition

@article{Wang2021DeepHR,
title={Deep High-Resolution Representation Learning for Visual Recognition},
author={Jingdong Wang and Ke Sun and Tianheng Cheng and Borui Jiang and Chaorui Deng and Yang Zhao and D. Liu and Yadong Mu and Mingkui Tan and Xinggang Wang and Wenyu Liu and Bin Xiao},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
year={2021},
volume={43},
pages={3349-3364}
}
• Jingdong Wang, +9 authors Bin Xiao
• Published 20 August 2019
• Computer Science, Medicine
• IEEE Transactions on Pattern Analysis and Machine Intelligence
High-resolution representations are essential for position-sensitive vision problems, such as human pose estimation, semantic segmentation, and object detection. [...] Key Method There are two key characteristics: (i) Connect the high-to-low resolution convolution streams \emph{in parallel}; (ii) Repeatedly exchange the information across resolutions. The benefit is that the resulting representation is semantically richer and spatially more precise. We show the superiority of the proposed HRNet in a wide range…Expand
548 Citations
HROM: Learning High-Resolution Representation and Object-Aware Masks for Visual Object Tracking
• Medicine, Computer Science
• Sensors
• 2020
A novel high-resolution Siamese network is proposed, which connects the high-to-low resolution convolution streams in parallel as well as repeatedly exchanges the information across resolutions to maintain high- resolution representations. Expand
HRFormer: High-Resolution Transformer for Dense Prediction
We present a High-Resolution Transformer (HRFormer) that learns high-resolution representations for dense prediction tasks, in contrast to the original Vision Transformer that produces low-resolutionExpand
Generic Perceptual Loss for Modeling Structured Output Dependencies
• Yifan Liu, Hao Chen, Yu Chen
• Computer Science
• 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
• 2021
It is demonstrated that a randomly-weighted deep CNN can be used to model the structured dependencies of outputs and removed the requirements of pre-training and a particular network structure (commonly, VGG) that are previously assumed for the perceptual loss. Expand
Estimating Human Pose Efficiently by Parallel Pyramid Networks
• Lin Zhao, Nannan Wang, Jian Yang, Xinbo Gao
• Medicine, Computer Science
• IEEE Transactions on Image Processing
• 2021
This paper designs a novel network architecture for human pose estimation, which aims to strike a fine balance between speed and accuracy, and refers to the architecture as “parallel pyramid” network (PPNet), as features of different resolutions are processed at different levels of the hierarchical model. Expand
Deep High-Resolution Representation Learning for Cross-Resolution Person Re-Identification
• Guoqing Zhang, Hao Wang, Yuhui Zheng
• Computer Science, Medicine
• IEEE Transactions on Image Processing
• 2021
A Deep High-Resolution Pseudo-Siamese Framework (PS-HRNet) is proposed to solve the problem of matching person images with the same identity from different cameras, and a pseudo-siamese framework is developed to reduce the difference of feature distributions between low- resolution images and high-resolution images. Expand
Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scenes
• Computer Science
• ArXiv
• 2021
Novel deep dual-resolution networks (DDRNets) are proposed for real-time semantic segmentation of road scenes and a new contextual information extractor named Deep Aggregation Pyramid Pooling Module (DAPPM) is designed to enlarge effective receptive fields and fuse multi-scale context. Expand
Parsing very high resolution urban scene images by learning deep ConvNets with edge-aware loss
• Computer Science
• 2020
A standalone end-to-end edge-aware neural network (EaNet) is proposed for urban scene semantic segmentation that incorporates a large kernel pyramid pooling (LKPP) module to capture rich multi-scale context with strong continuous feature relations. Expand
Conformer: Local Features Coupling Global Representations for Visual Recognition
Experiments show that Conformer, under the comparable parameter complexity, outperforms the visual transformer (DeiT-B) by 2.3% on ImageNet and on MSCOCO, it outperforms ResNet-101 by 3.7% and 3.6% mAPs for object detection and instance segmentation, respectively, demonstrating the great potential to be a general backbone network. Expand
Generating Superpixels for High-resolution Images with Decoupled Patch Calibration
• Yaxiong Wang, Yuchao Wei, Xueming Qian, Li Zhu, Yi Yang
• Computer Science
• ArXiv
• 2021
This paper devise Patch Calibration Networks (PCNet), aiming to efficiently and accurately implement high-resolution superpixel segmentation, and makes the first attempt to consider the deep-learning-based superpixel generation for high- resolution cases. Expand
Variational Structured Attention Networks for Deep Visual Representation Learning
• Computer Science
• ArXiv
• 2021
VISTA-Net outperforms the state-of-the-art in multiple continuous and discrete prediction tasks, thus confirming the benefit of the proposed approach in joint structured spatial-channel attention estimation for deep representation learning. Expand

References

SHOWING 1-10 OF 207 REFERENCES
High-Resolution Representations for Labeling Pixels and Regions
A simple modification is introduced to augment the high-resolution representation by aggregating the (upsampled) representations from all the parallel convolutions rather than only the representation from thehigh-resolution convolution, which leads to stronger representations, evidenced by superior results. Expand
Deep High-Resolution Representation Learning for Human Pose Estimation
• Computer Science
• 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
• 2019
This paper proposes a network that maintains high-resolution representations through the whole process of human pose estimation and empirically demonstrates the effectiveness of the network through the superior pose estimation results over two benchmark datasets: the COCO keypoint detection dataset and the MPII Human Pose dataset. Expand
Full-Resolution Residual Networks for Semantic Segmentation in Street Scenes
• Computer Science
• 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
• 2017
This work proposes a novel ResNet-like architecture that exhibits strong localization and recognition performance, and combines multi-scale context with pixel-level accuracy by using two processing streams within the network. Expand
RefineNet: Multi-path Refinement Networks for High-Resolution Semantic Segmentation
• Computer Science
• 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
• 2017
RefineNet is presented, a generic multi-path refinement network that explicitly exploits all the information available along the down-sampling process to enable high-resolution prediction using long-range residual connections and introduces chained residual pooling, which captures rich background context in an efficient manner. Expand
Multi-scale Location-Aware Kernel Representation for Object Detection
• Hao Wang
• Computer Science
• 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
• 2018
This paper proposes a novel Multi-scale Location-aware Kernel Representation (MLKP) to capture high-order statistics of deep features in proposals, which achieves very competitive performance with state-of-the-art methods, and improves Faster R-CNN by 4.9%, 4.7% and 5.0% respectively. Expand
Recurrent Scene Parsing with Perspective Understanding in the Loop
• Computer Science
• 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
• 2018
This work proposes a depth-aware gating module that adaptively selects the pooling field size in a convolutional network architecture according to the object scale so that small details are preserved for distant objects while larger receptive fields are used for those nearby. Expand
DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs
• Computer Science, Medicine
• IEEE Transactions on Pattern Analysis and Machine Intelligence
• 2018
This work addresses the task of semantic image segmentation with Deep Learning and proposes atrous spatial pyramid pooling (ASPP), which is proposed to robustly segment objects at multiple scales, and improves the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models. Expand
Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation
• Computer Science
• ECCV
• 2018
This work extends DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries and applies the depthwise separable convolution to both Atrous Spatial Pyramid Pooling and decoder modules, resulting in a faster and stronger encoder-decoder network. Expand
Deep Feature Pyramid Reconfiguration for Object Detection
• Computer Science
• ECCV
• 2018
A novel reconfiguration architecture is proposed to combine low-level representations with high-level semantic features in a highly-nonlinear yet efficient way to gather task-oriented features across different spatial locations and scales, globally and locally. Expand
Laplacian Pyramid Reconstruction and Refinement for Semantic Segmentation
• Computer Science
• ECCV
• 2016
A multi-resolution reconstruction architecture based on a Laplacian pyramid that uses skip connections from higher resolution feature maps and multiplicative gating to successively refine segment boundaries reconstructed from lower-resolution maps is described. Expand