• Corpus ID: 235313322

Container: Context Aggregation Network

@article{Gao2021ContainerCA,
  title={Container: Context Aggregation Network},
  author={Peng Gao and Jiasen Lu and Hongsheng Li and Roozbeh Mottaghi and Aniruddha Kembhavi},
  journal={ArXiv},
  year={2021},
  volume={abs/2106.01401}
}
Convolutional neural networks (CNNs) are ubiquitous in computer vision, with a myriad of effective and efficient variations. Recently, Transformers – originally introduced in natural language processing – have been increasingly adopted in computer vision. While early adopters continue to employ CNN backbones, the latest networks are end-to-end CNN-free Transformer solutions. A recent surprising finding shows that a simple MLP based solution without any traditional convolutional or Transformer… 

Figures and Tables from this paper

ConvMAE: Masked Convolution Meets Masked Autoencoders
TLDR
This paper demonstrates that multi-scale hybrid convolution-transformer can learn more discriminative representations via the mask auto-encoding scheme, and adopts the masked convolution to prevent information leakage in the convolution blocks.
Distillation with Contrast is All You Need for Self-Supervised Point Cloud Representation Learning
TLDR
A simple and general framework for self-supervised point cloud representation learning that achieves the state-of-theart performance on linear classification and multiple other downstream tasks and combines contrastive learning with knowledge distillation to make the teacher network be better updated.
Focal Modulation Networks
TLDR
The proposed focal modulation network (FocalNet in short), where self-attention is completely replaced by a focal modulation module that is more effective and efficient for modeling token interactions, renders focal modulation a favorable alternative to SA forEffective and efficient visual modeling in real-world applications.
Linear Array Network for Low-light Image Enhancement
TLDR
A Linear Array Self-attention (LASA) mechanism, which uses only two 2-D feature encodings to construct 3-D global weights and then refines feature maps generated by convolution layers is proposed, which is superior to the existing state-of-the-art (SOTA) methods in both RGB and RAW based low-light enhancement tasks with a smaller amount of parameters.
RestoreDet: Degradation Equivariant Representation for Object Detection in Low Resolution Images
TLDR
The extensive experiment shows that the novel framework, RestoreDet, based on CenterNet has achieved superior performance compared with existing methods when facing variant degradation situations.
S2-MLP: Spatial-Shift MLP Architecture for Vision
TLDR
This paper proposes a novel pure MLP architecture, spatial-shift MLP (S2-MLP), which accomplishes as excellent performance as ViT on ImageNet-1K dataset with considerably simpler architecture and fewer FLOPs and parameters.
UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning
TLDR
A novel Unified transFormer (UniFormer) which seamlessly integrates merits of 3D convolution and spatiotemporal self-attention in a concise transformer format, and achieves a preferable balance between computation and accuracy.
AS-MLP: An Axial Shifted MLP Architecture for Vision
TLDR
An Axial Shifted MLP architecture (AS-MLP), which is the first MLP-based architecture to be applied to the downstream tasks and achieves competitive performance compared to the transformer-based architectures even with slightly lower FLOPs.
Are we ready for a new paradigm shift? A Survey on Visual Deep MLP
TLDR
This review paper provides detailed discussions on whether MLP can be a new paradigm for computer vision, and compares the intrinsic connections and differences between convolution, self-attention mechanism, and Token-mixing MLP in detail.
Dual-stream Network for Visual Recognition
TLDR
This paper presents a generic Dual-stream Network (DS-Net) to fully explore the representation capacity of local and global pattern features for image classification, which can simultaneously calculate fine-grained and integrated features and efficiently fuse them.
...
1
2
3
...

References

SHOWING 1-10 OF 84 REFERENCES
Deformable DETR: Deformable Transformers for End-to-End Object Detection
TLDR
Deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference, can achieve better performance than DETR (especially on small objects) with 10$\times less training epochs.
Emerging Properties in Self-Supervised Vision Transformers
TLDR
This paper questions if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets) and implements DINO, a form of self-distillation with no labels, which implements the synergy between DINO and ViTs.
Focal Loss for Dense Object Detection
TLDR
This paper proposes to address the extreme foreground-background class imbalance encountered during training of dense detectors by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples, and develops a novel Focal Loss, which focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training.
Fast Convergence of DETR with Spatially Modulated Co-Attention
TLDR
This work proposes a simple yet effective scheme for improving the DETR framework, namely Spatially Modulated Co-Attention (SMCA) mechanism, which increases DETR’s convergence speed by replacing the original co-attention mechanism in the decoder while keeping other operations in DETR unchanged.
MLP-Mixer: An all-MLP Architecture for Vision
TLDR
It is shown that while convolutions and attention are both sufficient for good performance, neither of them are necessary, and MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs), attains competitive scores on image classification benchmarks.
Training data-efficient image transformers & distillation through attention
TLDR
This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.
End-to-End Object Detection with Transformers
TLDR
This work presents a new method that views object detection as a direct set prediction problem, and demonstrates accuracy and run-time performance on par with the well-established and highly-optimized Faster RCNN baseline on the challenging COCO object detection dataset.
Deep Residual Learning for Image Recognition
TLDR
This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.
ImageNet classification with deep convolutional neural networks
TLDR
A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.
ImageNet: A large-scale hierarchical image database
TLDR
A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
...
1
2
3
4
5
...