CodedVTR: Codebook-based Sparse Voxel Transformer with Geometric Guidance
@article{Zhao2022CodedVTRCS, title={CodedVTR: Codebook-based Sparse Voxel Transformer with Geometric Guidance}, author={Tianchen Zhao and Niansong Zhang and Xuefei Ning and He Wang and Li Yi and Yu Wang}, journal={ArXiv}, year={2022}, volume={abs/2203.09887} }
Transformers have gained much attention by outperforming convolutional neural networks in many 2D vision tasks. However, they are known to have generalization problems and rely on massive-scale pre-training and sophisticated training techniques. When applying to 3D tasks, the irregular data structure and limited data scale add to the difficulty of transformer’s application. We propose CodedVTR (Codebook-based Voxel TRansformer), which improves data efficiency and generalization ability for 3D…
Figures and Tables from this paper
References
SHOWING 1-10 OF 36 REFERENCES
Training data-efficient image transformers & distillation through attention
- Computer ScienceICML
- 2021
This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.
DeepViT: Towards Deeper Vision Transformer
- Computer ScienceArXiv
- 2021
This paper proposes a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity at different layers with negligible computation and memory cost and makes it feasible to train deeper ViTs with consistent performance improvements via minor modification to existing ViT models.
CvT: Introducing Convolutions to Vision Transformers
- Computer Science2021 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2021
A new architecture is presented that improves Vision Transformer (ViT) in performance and efficiency by introducing convolutions into ViT to yield the best of both de-signs, and the positional encoding, a crucial component in existing Vision Transformers, can be safely re-moved in this model.
When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations
- Computer ScienceArXiv
- 2021
By promoting smoothness with a recently proposed sharpness-aware optimizer, this paper substantially improve the accuracy and robustness of ViTs and MLP-Mixers on various tasks spanning supervised, adversarial, contrastive, and transfer learning.
Voxel Transformer for 3D Object Detection
- Computer Science2021 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2021
Voxel Transformer is presented, a novel and effective voxel-based Transformer backbone for 3D object detection from point clouds that shows consistent improvement over the convolutional baselines while maintaining computational efficiency on the KITTI dataset and the Waymo Open dataset.
4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks
- Computer Science2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2019
This work creates an open-source auto-differentiation library for sparse tensors that provides extensive functions for high-dimensional convolutional neural networks and proposes the hybrid kernel, a special case of the generalized sparse convolution, and trilateral-stationary conditional random fields that enforce spatio-temporal consistency in the 7D space-time-chroma space.
ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases
- Computer ScienceICML
- 2021
GPSA is introduced, a form of positional self-attention which can be equipped with a "soft" convolutional inductive bias and outperforms the DeiT on ImageNet, while offering a much improved sample efficiency.
PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space
- Computer ScienceNIPS
- 2017
A hierarchical neural network that applies PointNet recursively on a nested partitioning of the input point set and proposes novel set learning layers to adaptively combine features from multiple scales to learn deep point set features efficiently and robustly.
Searching Efficient 3D Architectures with Sparse Point-Voxel Convolution
- Computer ScienceECCV
- 2020
This work proposes Sparse Point-Voxel Convolution (SPVConv), a lightweight 3D module that equips the vanilla Sparse Convolution with the high-resolution point-based branch, and presents 3D Neural Architecture Search (3D-NAS) to search the optimal network architecture over this diverse design space efficiently and effectively.
Can Vision Transformers Perform Convolution?
- Computer ScienceArXiv
- 2021
This work proves that a single ViT layer with image patches as the input can perform any convolution operation constructively, where the multi-head attention mechanism and the relative positional encoding play essential roles.