Corpus ID: 232233178

TransFG: A Transformer Architecture for Fine-grained Recognition

  title={TransFG: A Transformer Architecture for Fine-grained Recognition},
  author={Ju He and Jieneng Chen and Shuai Liu and Adam Kortylewski and Cheng Yang and Yutong Bai and C. Wang and A. Yuille},
  • Ju He, Jieneng Chen, +5 authors A. Yuille
  • Published 2021
  • Computer Science
  • ArXiv
Fine-grained visual classification (FGVC) which aims at recognizing objects from subcategories is a very challenging task due to the inherently subtle inter-class differences. Recent works mainly tackle this problem by focusing on how to locate the most discriminative image regions and rely on them to improve the capability of networks to capture subtle variances. Most of these works achieve this by reusing the backbone network to extract features of selected regions. However, this strategy… Expand
Feature Fusion Vision Transformer Fine-Grained Visual Categorization
This work designs a novel token selection module called mutual attention weight selection (MAWS) to guide the network effectively and efficiently towards selecting discriminative tokens without introducing extra parameters and proves the effectiveness of FFVT on three benchmarks where FFVT achieves the state-of-the-art performance. Expand
RAMS-Trans: Recurrent Attention Multi-scale Transformer forFine-grained Image Recognition
  • Yunqing Hu, Xuan Jin, +4 authors Hui Xue
  • Computer Science
  • ArXiv
  • 2021
The recurrent attention multi-scale transformer (RAMS-Trans), which uses the transformer’s self-attention to recursively learn discriminative region attention in a multi- scale manner, and achieves state-of-the-art results on three benchmark datasets. Expand
Rediscovering R-NET : An Improvement and In-Depth Analysis on SQUAD 2 . 0
Question-answering is a discipline within the fields of information retrieval (IR) and natural language processing (NLP) that is concerned with building systems that automatically answer questionsExpand
It's FLAN time! Summing feature-wise latent representations for interpretability
Inspired by linear models and the KolmogorovArnol representation theorem, a novel class of structurally-constrained neural networks, which is called FLANs (Feature-wise Latent Additive Networks), which increase the interpretability of deep learning models. Expand


Filtration and Distillation: Enhancing Region Attention for Fine-Grained Visual Categorization
A novel “Filtration and Distillation Learning” (FDL) model is proposed to enhance the region attention of discriminate parts for FGVC that utilizes the proposing-predicting matchability as the performance metric of Region Proposal Network (RPN), thus enable a direct optimization of RPN to filtrate most discriminative regions. Expand
Fine-Grained Visual Classification via Progressive Multi-Granularity Training of Jigsaw Patches
This work proposes a novel framework for fine-grained visual classification with a progressive training strategy that effectively fuses features from different granularities, and a random jigsaw patch generator that encourages the network to learn features at specificgranularities. Expand
Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-Grained Image Recognition
A novel recurrent attention convolutional neural network (RA-CNN) which recursively learns discriminative region attention and region-based feature representation at multiple scales in a mutual reinforced way and achieves the best performance in three fine-grained tasks. Expand
Attention Convolutional Binary Neural Tree for Fine-Grained Visual Categorization
An attention convolutional binary neural tree architecture is presented to address problems for weakly supervised Fine-grained visual categorization and uses the attention transformer module to enforce the network to capture discriminative features. Expand
Looking for the Devil in the Details: Learning Trilinear Attention Sampling Network for Fine-Grained Image Recognition
TASN consists of a trilinear attention module, which generates attention maps by modeling the inter-channel relationships, an attention-based sampler which highlights attended parts with high resolution, and a feature distiller, which distills part features into an object-level feature by weight sharing and feature preserving strategies. Expand
Cross-X Learning for Fine-Grained Visual Categorization
This paper proposes Cross-X learning, a simple yet effective approach that exploits the relationships between different images and between different network layers for robust multi-scale feature learning and involves two novel components: a cross-category cross-semantic regularizer that guides the extracted features to represent semantic parts and aCross-layerRegularizer that improves the robustness of multi- scale features by matching the prediction distribution across multiple layers. Expand
Learning a Discriminative Filter Bank Within a CNN for Fine-Grained Recognition
This work shows that mid-level representation learning can be enhanced within the CNN framework, by learning a bank of convolutional filters that capture class-specific discriminative patches without extra part or bounding box annotations. Expand
Mask-CNN: Localizing Parts and Selecting Descriptors for Fine-Grained Image Recognition
The proposed Mask-CNN model has the smallest number of parameters, lowest feature dimensionality and highest recognition accuracy when compared with state-of-the-arts fine-grained approaches. Expand
Re-rank Coarse Classification with Local Region Enhanced Features for Fine-Grained Image Recognition
A retrievalbased coarse-to-fine framework is proposed, where the TopN classification results are rerank by using the local region enhanced embedding features to improve the Top1 accuracy and to obtain the discriminative regions for distinguishing the fine-grained images. Expand
Training data-efficient image transformers & distillation through attention
This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention. Expand