• Corpus ID: 246411225

DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR

  title={DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR},
  author={Shilong Liu and Feng Li and Hao Zhang and Xiao Bin Yang and Xianbiao Qi and Hang Su and Jun Zhu and Lei Zhang},
We present in this paper a novel query formulation using dynamic anchor boxes for DETR (DEtection TRansformer) and offer a deeper understanding of the role of queries in DETR. This new formulation directly uses box coordinates as queries in Transformer decoders and dynamically updates them layer-by-layer. Using box coordinates not only helps using explicit positional priors to improve the queryto-feature similarity and eliminate the slow training convergence issue in DETR, but also allows us to… 
DN-DETR: Accelerate DETR Training by Introducing Query DeNoising
A novel denoising training method to speedup DETR (DEtection TRansformer) training and offer a deepened understanding of the slow convergence issue of DETR-like methods with ResNet- 50 backbone.
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection
DINO improves over previous DETR-like models in performance and efficiency by using a contrastive way for denoising training, a mixed query selection method for anchor initialization, and a look forward twice scheme for box prediction.
Dynamic Focus-aware Positional Queries for Semantic Segmentation
The overall framework termed FASeg (Focus-Aware semantic Segmentation) provides a simple yet effective solution for semantic segmentation and proposes an efficient way to deal with high-resolutioncross-attention by dynamically determining the contextual tokens based on the low-resolution cross-att attention maps to perform local relation aggregation.
Improving Transferability for Domain Adaptive Detection Transformers
This paper proposes the Object-Aware Alignment (OAA) module and the Optimal Transport based Alignment module to achieve comprehensive domain alignment on the outputs of the backbone and the detector to build a simple but effective baseline with a DETR-style detector on domain shift settings.
Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation
The experiments show that Mask DINO significantly outperforms all existing specialized segmentation methods, both on a ResNet-50 backbone and a pre-trained model with SwinL backbone.
What Are Expected Queries in End-to-End Object Detection?
This paper shows that the expected queries in end-to-end object detection should be Dense Distinct Queries (DDQ), and introduces dense priors back to the framework to generate dense queries.
AO2-DETR: Arbitrary-Oriented Object Detection Transformer
This paper proposes an Arbitrary-Oriented Object DEtection TRansformer framework, termed AO2-DETR, which comprises three dedicated components and considerably simplifies the overall pipeline and presents a new AOOD paradigm.
A Survey of Visual Transformers
This survey has reviewed over one hundred of different visual Transformers comprehensively according to three fundamental CV tasks and different data stream types, and proposed the deformable attention module which combines the best of the sparse spatial sampling of deformable convo- lution, and the relation modeling capability of Transformers.
PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images
PETRv2 is proposed, a unified framework for 3D perception from multi-view images based on PETR, which explores the effectiveness of temporal modeling, which utilizes the temporal information of previous frames to boost 3D object detection and BEV segmentation.
Visual Attention Network
A novel large kernel attention (LKA) module is proposed to enable self-adaptive and long-range correlations in self-attention while avoiding the above issues and a novel neural network based on LKA is introduced, namely Visual Attention Network (VAN).


Anchor DETR: Query Design for Transformer-Based Detector
The proposed detector, called Anchor DETR, can achieve better performance and run faster than the DETR with 10× fewer training epochs, and an attention variant, which can reduce the memory cost while achieving similar or better performance than the standard attention in DETR.
Efficient DETR: Improving End-to-End Object Detector with Dense Prior
This paper investigates that the random initialization of object containers, which include object queries and reference points, is mainly responsible for the requirement of multiple iterations of object detection, and proposes Efficient DETR, a simple and efficient pipeline for end-to-end object detection.
Fast Convergence of DETR with Spatially Modulated Co-Attention
This work proposes a simple yet effective scheme for improving the DETR framework, namely Spatially Modulated Co-Attention (SMCA) mechanism, which increases DETR’s convergence speed by replacing the original co-attention mechanism in the decoder while keeping other operations in DETR unchanged.
Dynamic DETR: End-to-End Object Detection with Dynamic Attention
This paper introduces dynamic attentions into both the encoder and decoder stages of DETR to break its two limitations on small feature resolution and slow training convergence and introduces a dynamic decoder by replacing the cross-attention module with a ROI-based dynamic attention in the Transformer decoder.
Conditional DETR for Fast Training Convergence
The approach, named conditional DETR, learns a conditional spatial query from the decoder embedding for decoder multi-head cross-attention, which narrows down the spatial range for localizing the distinct regions for object classification and box regression, thus relaxing the dependence on the content embeddings and easing the training.
End-to-End Object Detection with Transformers
This work presents a new method that views object detection as a direct set prediction problem, and demonstrates accuracy and run-time performance on par with the well-established and highly-optimized Faster RCNN baseline on the challenging COCO object detection dataset.
Rethinking Transformer-based Set Prediction for Object Detection
Experimental results show that the proposed methods not only converge much faster than the original DETR, but also significantly outperform DETR and other baselines in terms of detection accuracy.
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
This work proposes a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit and derives a robust initialization method that particularly considers the rectifier nonlinearities.
Sparse R-CNN: End-to-End Object Detection with Learnable Proposals
  • Pei Sun, Rufeng Zhang, Ping Luo
  • Computer Science
    2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2021
Sparse R-CNN demonstrates accuracy, run-time and training convergence performance on par with the well-established detector baselines on the challenging COCO dataset, e.g., achieving 45.0 AP in standard 3× training schedule and running at 22 fps using ResNet-50 FPN model.
FCOS: Fully Convolutional One-Stage Object Detection
For the first time, a much simpler and flexible detection framework achieving improved detection accuracy is demonstrated, and it is hoped that the proposed FCOS framework can serve as a simple and strong alternative for many other instance-level tasks.