• Corpus ID: 233210459

Escaping the Big Data Paradigm with Compact Transformers

@article{Hassani2021EscapingTB,
  title={Escaping the Big Data Paradigm with Compact Transformers},
  author={Ali Hassani and Steven Walton and Nikhil Shah and Abulikemu Abuduweili and Jiachen Li and Humphrey Shi},
  journal={ArXiv},
  year={2021},
  volume={abs/2104.05704}
}
With the rise of Transformers as the standard for language processing, and their advancements in computer vision, there has been a corresponding growth in parameter size and amounts of training data. Many have come to believe that because of this, transformers are not suitable for small sets of data. This trend leads to concerns such as: limited availability of data in certain scientific domains and the exclusion of those with limited resource from research in the field. In this paper, we aim to… 
Aggregating Nested Transformers
TLDR
Beyond image classification, the idea of nesting basic local transformers on non-overlapping image blocks and aggregating them in a hierarchical manner is extended to image generation and shows NesT leads to a strong decoder that is 8× faster than previous transformer based generators.
Vision Xformers: Efficient Attention for Image Classification
TLDR
This work modifications the ViT architecture to work on longer sequence data by replacing the quadratic attention with efficient transformers, such as Performer, Linformer and Nyströmformer of linear complexity creating Vision X-formers (ViX), and shows that all three versions of ViX may be more accurate than ViT for image classification while using far fewer parameters and computational resources.
Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer
TLDR
A new vision transformer is proposed, named Shuffle Transformer, which is highly efficient and easy to implement by modifying two lines of code and the depth-wise convolution is introduced to complement the spatial shuffle for enhancing neighbor-window connections.
MSN: Efficient Online Mask Selection Network for Video Instance Segmentation
TLDR
This work presents a novel solution for Video Instance Segmentation (VIS), that is automatically generating instance level segmentation masks along with object class and tracking them in a video using the Mask Selection Network (MSN).
Test-Time Robust Personalization for Federated Learning
TLDR
This work identifies the pitfalls of existing works under test-time distribution shifts and proposes a novel test- time robust personalization method, namely Federated Test-time Head Ensemble plus tuning (FedTHE+), which illustrates the advancement of FedTHE+ over strong competitors.
CV4Code: Sourcecode Understanding via Visual Code Representations
. We present CV4Code, a compact and effective computer vision method for sourcecode understanding. Our method leverages the contextual and the structural information available from the code snippet
Deep Visual Geo-localization Benchmark
TLDR
A new open-source benchmarking framework for Visual Geo- localization that allows to build, train, and test a wide range of commonly used architectures, with the flexibility to change individual components of a geo-localization pipeline is proposed.
Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding
TLDR
This paper explores the idea of nesting basic local transformers on non-overlapping image blocks and aggregating them in a hierarchical way and finds that the block aggregation function plays a critical role in enabling cross-block non-local information communication.
Couplformer: Rethinking Vision Transformer with Coupling Attention Map
TLDR
A novel memory economy attention mechanism named Couplformer is proposed, which decouples the attention map into two sub-matrices and generates the alignment scores from spatial information and can serve as an efficient backbone in visual tasks, and provide a novel perspective on the attention mechanism for researchers.
Hybrid BYOL-ViT: Efficient approach to deal with small datasets
TLDR
This paper investigates how self-supervision with strong and sufficient augmentation of unlabeled data can train effectively the first layers of a neural network even better than supervised learning, with no need for millions of labeled data.
...
...

References

SHOWING 1-10 OF 60 REFERENCES
Incorporating Convolution Designs into Visual Transformers
TLDR
A new Convolution-enhanced image Transformer (CeiT) is proposed which combines the advantages of CNNs in extracting low-level features, strengthening locality, and the disadvantages of Transformers in establishing long-range dependencies.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
TLDR
Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
Training data-efficient image transformers & distillation through attention
TLDR
This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.
CvT: Introducing Convolutions to Vision Transformers
TLDR
A new architecture is presented that improves Vision Transformer (ViT) in performance and efficiency by introducing convolutions into ViT to yield the best of both de-signs, and the positional encoding, a crucial component in existing Vision Transformers, can be safely re-moved in this model.
Revisiting Unreasonable Effectiveness of Data in Deep Learning Era
TLDR
It is found that the performance on vision tasks increases logarithmically based on volume of training data size, and it is shown that representation learning (or pre-training) still holds a lot of promise.
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet
TLDR
A new Tokens-To-Token Vision Transformer (T2T-VTT), which incorporates an efficient backbone with a deep-narrow structure for vision transformer motivated by CNN architecture design after empirical study and reduces the parameter count and MACs of vanilla ViT by half.
Optimizing Deeper Transformers on Small Datasets: An Application on Text-to-SQL Semantic Parsing
TLDR
This work successfully train 48 layers of transformers for a semantic parsing task and obtains the state of the art performance on the challenging cross-domain Textto-SQL semantic parsing benchmark Spider.
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Squeeze-and-Excitation Networks
TLDR
This work proposes a novel architectural unit, which is term the “Squeeze-and-Excitation” (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels and shows that these blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets.
ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases
TLDR
GPSA is introduced, a form of positional self-attention which can be equipped with a "soft" convolutional inductive bias and outperforms the DeiT on ImageNet, while offering a much improved sample efficiency.
...
...