• Corpus ID: 245501976

Vision Transformer for Small-Size Datasets

@article{Lee2021VisionTF,
  title={Vision Transformer for Small-Size Datasets},
  author={Seung Hoon Lee and Seunghyun Lee and Byung Cheol Song},
  journal={ArXiv},
  year={2021},
  volume={abs/2112.13492}
}
Recently, the Vision Transformer (ViT), which applied the transformer structure to the image classification task, has outperformed convolutional neural networks. However, the high performance of the ViT results from pre-training using a large-size dataset such as JFT-300M, and its dependence on a large dataset is interpreted as due to low locality inductive bias. This paper proposes Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA), which effectively solve the lack of locality… 

Vision Transformers in 2022: An Update on Tiny ImageNet

TLDR
An update on vision transformers’ performance on Tiny ImageNet is offered and Swin Transformers beats the current state-of-the-art re-sult with a validation accuracy of 91.35% .

MDMLP: Image Classification from Scratch on Small Datasets with MLP

TLDR
A conceptually simple and lightweight MLP-based architecture yet achieves SOTA when training from scratch on small-size datasets; and a novel and efficient attention mechanism based on MLPs that high-lights objects in images, indicating its explanation power.

EfficientFormer: Vision Transformers at MobileNet Speed

TLDR
This work proves that properly designed transformers can reach extremely low latency on mobile devices while maintaining high performance 1 based architectures, whereby the latency-driven analysis of ViT architecture and the experimental results validate the claim: powerful vision transformer can achieve ultra-fast inference speed on the edge.

Swin transformer for hyperspectral rare sub-pixel target detection

TLDR
This paper adapts the Swin Transformer for hyperspectral classification and rare sub-pixel target detection and applies this new architecture to commonly studied classification benchmarks public datasets, and to a new, large-scale airborne sub- pixel target detection dataset the authors developed.

Your ViT is Secretly a Hybrid Discriminative-Generative Diffusion Model

TLDR
This work establishes a direct connection between DDPM and ViT by integrating the ViT architecture into DDPM, and introduces a new generative model called Generative ViT (GenViT), which is among the first to explore a single ViT for image generation and classification jointly.

Few-Shot Diffusion Models

TLDR
Few-Shot Diffusion Models (FSDM), a framework for few-shot generation leveraging conditional DDPMs, and how conditioning the model on patch-based input set information improves training convergence is shown.

Class-attention Video Transformer for Engagement Intensity Prediction

TLDR
A new end-to-end method Class Attention in Video Transformer (CavT), which involves a single vector to process class embedding and to uniformly perform end- to-end learning on variant-length long videos and fixed-length short videos.

Multi-Class CNN for Classification of Multispectral and Autofluorescence Skin Lesion Clinical Images

TLDR
It was concluded from saliency maps that the classification performed by the convolutional neural network is based on the distribution of the major skin chromophores and endogenous fluorophores, and the resulting classification confusion matrices have been investigated and discussed.

CV4Code: Sourcecode Understanding via Visual Code Representations

. We present CV4Code, a compact and effective computer vision method for sourcecode understanding. Our method leverages the contextual and the structural information available from the code snippet

AA-TransUNet: Attention Augmented TransUNet For Nowcasting Tasks

TLDR
This paper introduces a novel data-driven predictive model based on TransUNet for precipitation nowcasting task and shows that the proposed model outperforms other examined models on both tested datasets.

References

SHOWING 1-10 OF 45 REFERENCES

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

TLDR
A new Tokens-To-Token Vision Transformer (T2T-VTT), which incorporates an efficient backbone with a deep-narrow structure for vision transformer motivated by CNN architecture design after empirical study and reduces the parameter count and MACs of vanilla ViT by half.

Training data-efficient image transformers & distillation through attention

TLDR
This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

TLDR
Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

Going deeper with Image Transformers

TLDR
This work builds and optimize deeper transformer networks for image classification and investigates the interplay of architecture and optimization of such dedicated transformers, making two architecture changes that significantly improve the accuracy of deep transformers.

Bottleneck Transformers for Visual Recognition

TLDR
BoTNet is presented, a conceptually simple yet powerful backbone architecture that incorporates self-attention for multiple computer vision tasks including image classification, object detection and instance segmentation, and a simple adaptation of the BoTNet design for image classification is presented.

Rethinking Spatial Dimensions of Vision Transformers

TLDR
A novel Pooling-based Vision Transformer (PiT) is proposed, which achieves the improved model capability and generalization performance against ViT and outperforms the baseline on several tasks such as image classification, object detection and robustness evaluation.

Self-Attention Generative Adversarial Networks

TLDR
The proposed SAGAN achieves the state-of-the-art results, boosting the best published Inception score from 36.8 to 52.52 and reducing Frechet Inception distance from 27.62 to 18.65 on the challenging ImageNet dataset.

Tiny ImageNet Visual Recognition Challenge

TLDR
This work investigates the effect of convolutional network depth, receptive field size, dropout layers, rectified activation unit type and dataset noise on its accuracy in Tiny-ImageNet Challenge settings and achieves excellent performance even compared to state-of-the-art results.

Squeeze-and-Excitation Networks

TLDR
This work proposes a novel architectural unit, which is term the “Squeeze-and-Excitation” (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels and shows that these blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets.

CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features

TLDR
Patches are cut and pasted among training images where the ground truth labels are also mixed proportionally to the area of the patches, and CutMix consistently outperforms state-of-the-art augmentation strategies on CIFAR and ImageNet classification tasks, as well as on ImageNet weakly-supervised localization task.