Corpus ID: 235458262

XCiT: Cross-Covariance Image Transformers

  title={XCiT: Cross-Covariance Image Transformers},
  author={Alaaeldin El-Nouby and Hugo Touvron and Mathilde Caron and Piotr Bojanowski and M. Douze and Armand Joulin and I. Laptev and N. Neverova and Gabriel Synnaeve and Jakob Verbeek and H. J{\'e}gou},
Following tremendous success in natural language processing, transformers have recently shown much promise for computer vision. The self-attention operation underlying transformers yields global interactions between all tokens, i.e. words or image patches, and enables flexible modelling of image data beyond the local interactions of convolutions. This flexibility, however, comes with a quadratic complexity in time and memory, hindering application to long sequences and high-resolution images… Expand
Contextual Transformer Networks for Visual Recognition
This work designs a novel Transformer-style module, i.e., Contextual Transformer (CoT) block, for visual recognition, which can readily replace each 3 × 3 convolution in ResNet architectures, yielding a Trans transformer-style backbone named as ContextualTransformer Networks ( coTNet). Expand
RaftMLP: Do MLP-based Models Dream of Winning Over Computer Vision?
It is indicated that MLP-based models have the potential to replace CNNs by adopting inductive bias and the proposed model, named RaftMLP has a good balance of computational complexity, the number of parameters, and actual memory usage. Expand
A Survey on Vision Transformer
Transformer, first applied to the field of natural language processing, is a type of deep neural network mainly based on the self-attention mechanism. Thanks to its strong representationExpand
Scaled ReLU Matters for Training Vision Transformers
  • Pichao Wang, Xue Wang, +5 authors Rong Jin
  • Computer Science
  • 2021
Vision transformers (ViTs) have been an alternative design paradigm to convolutional neural networks (CNNs). However, the training of ViTs is much harder than CNNs, as it is sensitive to the trainingExpand
Attention mechanism and mixup data augmentation for classification of COVID-19 Computed Tomography images
ResNet50 architecture extended with a feature-wise attention layer obtained 95.57% accuracy score, which, to best of the knowledge, fixes the state-of-the-art in the challenging COVID-CT dataset. Expand
Deep neural networks approach to microbial colony detection - a comparative analysis
Counting microbial colonies is a fundamental task in microbiology and has many applications in numerous industry branches. Despite this, current studies towards automatic microbial counting usingExpand
Phenotyping of Klf14 mouse white adipose tissue enabled by whole slide segmentation with deep neural networks
The deep learning pipeline DeepCytometer and associated exploratory analysis reveal new insights into adipocyte heterogeneity and phenotyping. Expand
CDTrans: Cross-domain Transformer for Unsupervised Domain Adaptation
  • Tongkun Xu, Weihua Chen, Pichao Wang, Fan Wang, Hao Li, Rong Jin
  • Computer Science
  • 2021
Unsupervised domain adaptation (UDA) aims to transfer knowledge learned from a labeled source domain to a different unlabeled target domain. Most existing UDA methods focus on learningExpand
DisCo: Remedy Self-supervised Learning on Lightweight Models with Distilled Contrastive Learning
This work proposes to distill the final embedding to maximally transmit a teacher’s knowledge to a lightweight model by constraining the last embedding of the student to be consistent with that of the teacher, and achieves the state-of-theart on all lightweight models. Expand


Incorporating Convolution Designs into Visual Transformers
A new Convolution-enhanced image Transformer (CeiT) is proposed which combines the advantages of CNNs in extracting lowlevel features, strengthening locality, and theadvantages of Transformers in establishing long-range dependencies. Expand
Image Transformer
This work generalizes a recently proposed model architecture based on self-attention, the Transformer, to a sequence modeling formulation of image generation with a tractable likelihood, and significantly increases the size of images the model can process in practice, despite maintaining significantly larger receptive fields per layer than typical convolutional neural networks. Expand
Training data-efficient image transformers & distillation through attention
This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention. Expand
Axial Attention in Multidimensional Transformers
Axial Transformers is proposed, a self-attention-based autoregressive model for images and other data organized as high dimensional tensors that maintains both full expressiveness over joint distributions over data and ease of implementation with standard deep learning frameworks, while requiring reasonable memory and computation. Expand
Emerging Properties in Self-Supervised Vision Transformers
This paper questions if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets), and implements DINO, a form of self-distillation with no labels, which is implemented into a simple self- supervised method. Expand
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
A new vision Transformer is presented that capably serves as a general-purpose backbone for computer vision and has the flexibility to model at various scales and has linear computational complexity with respect to image size. Expand
Training Vision Transformers for Image Retrieval
This work adopts vision transformers for generating image descriptors and train the resulting model with a metric learning objective, which combines a contrastive loss with a differential entropy regularizer, and shows consistent and significant improvements of transformers over convolutionbased approaches. Expand
Squeeze-and-Excitation Networks
This work proposes a novel architectural unit, which is term the “Squeeze-and-Excitation” (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels and shows that these blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets. Expand
Linformer: Self-Attention with Linear Complexity
This paper demonstrates that the self-attention mechanism of the Transformer can be approximated by a low-rank matrix, and proposes a new self-Attention mechanism, which reduces the overall self-ATTention complexity from $O(n^2)$ to $O (n)$ in both time and space. Expand
Vision Transformers for Dense Prediction
D dense vision transformers is introduced, an architecture that leverages visiontransformers in place of convolutional networks as a backbone for dense prediction tasks and can be fine-tuned on smaller datasets such as NYUv2, KITTI, and Pascal Context where it sets the new state of the art. Expand