RAMS-Trans: Recurrent Attention Multi-scale Transformer for Fine-grained Image Recognition

  title={RAMS-Trans: Recurrent Attention Multi-scale Transformer for Fine-grained Image Recognition},
  author={Yunqing Hu and Xuan Jin and Yin Zhang and Ha Hong and Jingfeng Zhang and Yuan He and Hui Xue},
  journal={Proceedings of the 29th ACM International Conference on Multimedia},
  • Yunqing Hu, Xuan Jin, +4 authors Hui Xue
  • Published 2021
  • Computer Science
  • Proceedings of the 29th ACM International Conference on Multimedia
In fine-grained image recognition (FGIR), the localization and amplification of region attention is an important factor, which has been explored extensively convolutional neural networks (CNNs) based approaches. The recently developed vision transformer (ViT) has achieved promising results in computer vision tasks. Compared with CNNs, Image sequentialization is a brand new manner. However, ViT is limited in its receptive field size and thus lacks local attention like CNNs due to the fixed size… Expand
1 Citations

Figures and Tables from this paper

A free lunch from ViT: Adaptive Attention Multi-scale Fusion Transformer for Fine-grained Visual Recognition
  • Yuan Zhang, Jian Cao, +4 authors Weiqian Chen
  • Computer Science
  • ArXiv
  • 2021
This work proposes a novel method named Adaptive attention multi-scale Fusion Transformer (AFTrans), which can achieve SOTA performance on three published fine-grained benchmarks: CUB-200-2011, Stanford Dogs and iNat2017. Expand


Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-Grained Image Recognition
A novel recurrent attention convolutional neural network (RA-CNN) which recursively learns discriminative region attention and region-based feature representation at multiple scales in a mutual reinforced way and achieves the best performance in three fine-grained tasks. Expand
TransFG: A Transformer Architecture for Fine-grained Recognition
A novel transformer-based framework TransFG is proposed where all raw attention weights of the transformer are integrated into an attention map for guiding the network to effectively and accurately select discriminative image patches and compute their relations. Expand
Filtration and Distillation: Enhancing Region Attention for Fine-Grained Visual Categorization
A novel “Filtration and Distillation Learning” (FDL) model is proposed to enhance the region attention of discriminate parts for FGVC that utilizes the proposing-predicting matchability as the performance metric of Region Proposal Network (RPN), thus enable a direct optimization of RPN to filtrate most discriminative regions. Expand
Learning Multi-attention Convolutional Neural Network for Fine-Grained Image Recognition
This paper proposes a novel part learning approach by a multi-attention convolutional neural network (MA-CNN), where part generation and feature learning can reinforce each other, and shows the best performances on three challenging published fine-grained datasets. Expand
Attention Convolutional Binary Neural Tree for Fine-Grained Visual Categorization
An attention convolutional binary neural tree architecture is presented to address problems for weakly supervised Fine-grained visual categorization and uses the attention transformer module to enforce the network to capture discriminative features. Expand
Looking for the Devil in the Details: Learning Trilinear Attention Sampling Network for Fine-Grained Image Recognition
TASN consists of a trilinear attention module, which generates attention maps by modeling the inter-channel relationships, an attention-based sampler which highlights attended parts with high resolution, and a feature distiller, which distills part features into an object-level feature by weight sharing and feature preserving strategies. Expand
Mask-CNN: Localizing Parts and Selecting Descriptors for Fine-Grained Image Recognition
The proposed Mask-CNN model has the smallest number of parameters, lowest feature dimensionality and highest recognition accuracy when compared with state-of-the-arts fine-grained approaches. Expand
Learning a Discriminative Filter Bank Within a CNN for Fine-Grained Recognition
This work shows that mid-level representation learning can be enhanced within the CNN framework, by learning a bank of convolutional filters that capture class-specific discriminative patches without extra part or bounding box annotations. Expand
Fine-Grained Recognition as HSnet Search for Informative Image Parts
This work addresses fine-grained image classification by forming the problem as a sequential search for informative parts over a deep feature map produced by a deep Convolutional Neural Network (CNN). Expand
Learning Deep Bilinear Transformation for Fine-grained Image Representation
A deep bilinear transformation (DBT) block, which can be deeply stacked in convolutional neural networks to learn fine-grained image representations, and achieves new state-of-the-art in several fine- grained image recognition benchmarks, including CUB-Bird, Stanford-Car, and FGVC-Aircraft. Expand