Deep High-Resolution Representation Learning for Visual Recognition

@article{Wang2021DeepHR,
  title={Deep High-Resolution Representation Learning for Visual Recognition},
  author={Jingdong Wang and Ke Sun and Tianheng Cheng and Borui Jiang and Chaorui Deng and Yang Zhao and D. Liu and Yadong Mu and Mingkui Tan and Xinggang Wang and Wenyu Liu and Bin Xiao},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2021},
  volume={43},
  pages={3349-3364}
}
  • Jingdong Wang, Ke Sun, +9 authors Bin Xiao
  • Published 2021
  • Computer Science, Medicine
  • IEEE Transactions on Pattern Analysis and Machine Intelligence
High-resolution representations are essential for position-sensitive vision problems, such as human pose estimation, semantic segmentation, and object detection. [...] Key Method There are two key characteristics: (i) Connect the high-to-low resolution convolution streams \emph{in parallel}; (ii) Repeatedly exchange the information across resolutions. The benefit is that the resulting representation is semantically richer and spatially more precise. We show the superiority of the proposed HRNet in a wide range…Expand
HROM: Learning High-Resolution Representation and Object-Aware Masks for Visual Object Tracking
TLDR
A novel high-resolution Siamese network is proposed, which connects the high-to-low resolution convolution streams in parallel as well as repeatedly exchanges the information across resolutions to maintain high- resolution representations. Expand
Estimating Human Pose Efficiently by Parallel Pyramid Networks
  • Lin Zhao, Nannan Wang, Chen Gong, Jian Yang, Xinbo Gao
  • Medicine, Computer Science
  • IEEE Transactions on Image Processing
  • 2021
TLDR
This paper designs a novel network architecture for human pose estimation, which aims to strike a fine balance between speed and accuracy, and refers to the architecture as “parallel pyramid” network (PPNet), as features of different resolutions are processed at different levels of the hierarchical model. Expand
Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scenes
TLDR
Novel deep dual-resolution networks (DDRNets) are proposed for real-time semantic segmentation of road scenes and a new contextual information extractor named Deep Aggregation Pyramid Pooling Module (DAPPM) is designed to enlarge effective receptive fields and fuse multi-scale context. Expand
Generic Perceptual Loss for Modeling Structured Output Dependencies
TLDR
It is demonstrated that a randomly-weighted deep CNN can be used to model the structured dependencies of outputs and removed the requirements of pre-training and a particular network structure (commonly, VGG) that are previously assumed for the perceptual loss. Expand
Deep High-Resolution Representation Learning for Cross-Resolution Person Re-identification
TLDR
A Deep High-Resolution PseudoSiamese Framework (PS-HRNet) is proposed to solve the problem of matching person images with the same identity from different cameras and a pseudo-siamese framework is constructed to reduce the difference of feature distributions between low- resolution images and high-resolution images. Expand
Parsing very high resolution urban scene images by learning deep ConvNets with edge-aware loss
TLDR
A standalone end-to-end edge-aware neural network (EaNet) is proposed for urban scene semantic segmentation that incorporates a large kernel pyramid pooling (LKPP) module to capture rich multi-scale context with strong continuous feature relations. Expand
Conformer: Local Features Coupling Global Representations for Visual Recognition
TLDR
Experiments show that Conformer, under the comparable parameter complexity, outperforms the visual transformer (DeiT-B) by 2.3% on ImageNet and on MSCOCO, it outperforms ResNet-101 by 3.7% and 3.6% mAPs for object detection and instance segmentation, respectively, demonstrating the great potential to be a general backbone network. Expand
Generating Superpixels for High-resolution Images with Decoupled Patch Calibration
  • Yaxiong Wang, Yuchao Wei, Xueming Qian, Li Zhu, Yi Yang
  • Computer Science
  • ArXiv
  • 2021
TLDR
This paper devise Patch Calibration Networks (PCNet), aiming to efficiently and accurately implement high-resolution superpixel segmentation, and makes the first attempt to consider the deep-learning-based superpixel generation for high- resolution cases. Expand
Variational Structured Attention Networks for Deep Visual Representation Learning
TLDR
VISTA-Net outperforms the state-of-the-art in multiple continuous and discrete prediction tasks, thus confirming the benefit of the proposed approach in joint structured spatial-channel attention estimation for deep representation learning. Expand
Attention-Based Context Aware Network for Semantic Comprehension of Aerial Scenery
TLDR
An end-to-end semantic segmentation model for aerial images is developed and the experimental results show that the model improves the baseline accuracy and outperforms some commonly used CNN architectures. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 207 REFERENCES
High-Resolution Representations for Labeling Pixels and Regions
TLDR
A simple modification is introduced to augment the high-resolution representation by aggregating the (upsampled) representations from all the parallel convolutions rather than only the representation from thehigh-resolution convolution, which leads to stronger representations, evidenced by superior results. Expand
Deep High-Resolution Representation Learning for Human Pose Estimation
TLDR
This paper proposes a network that maintains high-resolution representations through the whole process of human pose estimation and empirically demonstrates the effectiveness of the network through the superior pose estimation results over two benchmark datasets: the COCO keypoint detection dataset and the MPII Human Pose dataset. Expand
Full-Resolution Residual Networks for Semantic Segmentation in Street Scenes
TLDR
This work proposes a novel ResNet-like architecture that exhibits strong localization and recognition performance, and combines multi-scale context with pixel-level accuracy by using two processing streams within the network. Expand
RefineNet: Multi-path Refinement Networks for High-Resolution Semantic Segmentation
TLDR
RefineNet is presented, a generic multi-path refinement network that explicitly exploits all the information available along the down-sampling process to enable high-resolution prediction using long-range residual connections and introduces chained residual pooling, which captures rich background context in an efficient manner. Expand
Multi-scale Location-Aware Kernel Representation for Object Detection
TLDR
This paper proposes a novel Multi-scale Location-aware Kernel Representation (MLKP) to capture high-order statistics of deep features in proposals, which achieves very competitive performance with state-of-the-art methods, and improves Faster R-CNN by 4.9%, 4.7% and 5.0% respectively. Expand
Recurrent Scene Parsing with Perspective Understanding in the Loop
TLDR
This work proposes a depth-aware gating module that adaptively selects the pooling field size in a convolutional network architecture according to the object scale so that small details are preserved for distant objects while larger receptive fields are used for those nearby. Expand
DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs
TLDR
This work addresses the task of semantic image segmentation with Deep Learning and proposes atrous spatial pyramid pooling (ASPP), which is proposed to robustly segment objects at multiple scales, and improves the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models. Expand
Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation
TLDR
This work extends DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries and applies the depthwise separable convolution to both Atrous Spatial Pyramid Pooling and decoder modules, resulting in a faster and stronger encoder-decoder network. Expand
Deep Feature Pyramid Reconfiguration for Object Detection
TLDR
A novel reconfiguration architecture is proposed to combine low-level representations with high-level semantic features in a highly-nonlinear yet efficient way to gather task-oriented features across different spatial locations and scales, globally and locally. Expand
Laplacian Pyramid Reconstruction and Refinement for Semantic Segmentation
TLDR
A multi-resolution reconstruction architecture based on a Laplacian pyramid that uses skip connections from higher resolution feature maps and multiplicative gating to successively refine segment boundaries reconstructed from lower-resolution maps is described. Expand
...
1
2
3
4
5
...