• Corpus ID: 235313322

Container: Context Aggregation Network

  title={Container: Context Aggregation Network},
  author={Peng Gao and Jiasen Lu and Hongsheng Li and Roozbeh Mottaghi and Aniruddha Kembhavi},
Convolutional neural networks (CNNs) are ubiquitous in computer vision, with a myriad of effective and efficient variations. Recently, Transformers – originally introduced in natural language processing – have been increasingly adopted in computer vision. While early adopters continue to employ CNN backbones, the latest networks are end-to-end CNN-free Transformer solutions. A recent surprising finding shows that a simple MLP based solution without any traditional convolutional or Transformer… 

Figures and Tables from this paper

ConvMAE: Masked Convolution Meets Masked Autoencoders

This paper demonstrates that multi-scale hybrid convolution-transformer can learn more discriminative representations via the mask auto-encoding scheme, and adopts the masked convolution to prevent information leakage in the convolution blocks.

Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature Mimicking

Mimic before Reconstruct for Masked Autoencoders is proposed, named as MR-MAE, which jointly learns high-level and low-level representations without interference during pre-training, and achieves superior visual representations for various downstream tasks.

Bi-Directional Self-Attention for Vision Transformers

Inverse Self-Attention (ISA) is proposed, an attention layer that couples SA and ISA by convexly combining their outputs and can be easily adapted into any existing transformer architecture to improve the expressibility of attention layers.

SPViT: Enabling Faster Vision Transformers via Latency-Aware Soft Token Pruning

A dynamic attention-based multi-head token selector is designed, which is a lightweight module for adaptive instance-wise token selection, and a soft pruning technique is introduced, which integrates the less informative tokens chosen by the selector module into a package token rather than discarding them completely.

Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer

This work identifies the bottleneck for severe performance drop comes from the information distortion of the low-bit quantized self-attention map, and develops an information rectification module (IRM) and a distribution guided distillation (DGD) scheme for fully quantized vision transformers (Q-ViT) to effectively eliminate such distortion.

Underwater Target Detection Based on Improved YOLOv4

The experimental results show that the detection effect of the proposed underwater target detection algorithm, based on improved YOLOv4, is better than several other methods.

PS-ARM: An End-to-End Attention-aware Relation Mixer Network for Person Search

A novel attention-aware relation mixer (ARM) module for person search, which exploits the global relation between different local regions within RoI of a person and make it robust against various appearance deformations and occlusion.

Adaptive Local Context Embedding for Small Vehicle Detection from Aerial Optical Remote Sensing Images

The results indicate that the proposed ALC-Net can exhibit the competitive small vehicle detection performance than other detectors.

End-to-end View Synthesis via NeRF Attention

The NeRF attention (NeRFA) is proposed, which considers the volumetric rendering equation as a soft feature modulation procedure and adopts the ray and pixel transformers to learn the interactions between rays and pixels.

Static and Dynamic Concepts for Self-supervised Video Representation Learning

A novel learning scheme to first learn general visual concepts then attend to discriminative local areas for video understanding, which utilizes static frame and frame difference to help decouple static and dynamic concepts, and respectively align the concept distributions in latent space.



Emerging Properties in Self-Supervised Vision Transformers

This paper questions if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets) and implements DINO, a form of self-distillation with no labels, which implements the synergy between DINO and ViTs.

Deformable DETR: Deformable Transformers for End-to-End Object Detection

Deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference, can achieve better performance than DETR (especially on small objects) with 10$\times less training epochs.

Focal Loss for Dense Object Detection

This paper proposes to address the extreme foreground-background class imbalance encountered during training of dense detectors by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples, and develops a novel Focal Loss, which focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training.

MLP-Mixer: An all-MLP Architecture for Vision

It is shown that while convolutions and attention are both sufficient for good performance, neither of them are necessary, and MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs), attains competitive scores on image classification benchmarks.

Fast Convergence of DETR with Spatially Modulated Co-Attention

This work proposes a simple yet effective scheme for improving the DETR framework, namely Spatially Modulated Co-Attention (SMCA) mechanism, which increases DETR’s convergence speed by replacing the original co-attention mechanism in the decoder while keeping other operations in DETR unchanged.

Training data-efficient image transformers & distillation through attention

This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.

End-to-End Object Detection with Transformers

This work presents a new method that views object detection as a direct set prediction problem, and demonstrates accuracy and run-time performance on par with the well-established and highly-optimized Faster RCNN baseline on the challenging COCO object detection dataset.

Deep Residual Learning for Image Recognition

This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

ImageNet classification with deep convolutional neural networks

A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.

ImageNet: A large-scale hierarchical image database

A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.