An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
- A. Dosovitskiy, L. Beyer, N. Houlsby
- Computer ScienceInternational Conference on Learning…
- 22 October 2020
Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
CARLA: An Open Urban Driving Simulator
- A. Dosovitskiy, G. Ros, Felipe Codevilla, Antonio M. López, V. Koltun
- Computer ScienceConference on Robot Learning
- 18 October 2017
This work introduces CARLA, an open-source simulator for autonomous driving research, and uses it to study the performance of three approaches to autonomous driving: a classic modular pipeline, an end-to-end model trained via imitation learning, and an end to-end models trained via reinforcement learning.
FlowNet: Learning Optical Flow with Convolutional Networks
- A. Dosovitskiy, P. Fischer, T. Brox
- Computer ScienceIEEE International Conference on Computer Vision
- 26 April 2015
This paper constructs CNNs which are capable of solving the optical flow estimation problem as a supervised learning task, and proposes and compares two architectures: a generic architecture and another one including a layer that correlates feature vectors at different image locations.
A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation
This paper proposes three synthetic stereo video datasets with sufficient realism, variation, and size to successfully train large networks and presents a convolutional network for real-time disparity estimation that provides state-of-the-art results.
Striving for Simplicity: The All Convolutional Net
- J. T. Springenberg, A. Dosovitskiy, T. Brox, Martin A. Riedmiller
- Computer ScienceInternational Conference on Learning…
- 21 December 2014
It is found that max-pooling can simply be replaced by a convolutional layer with increased stride without loss in accuracy on several image recognition benchmarks.
FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks
- Eddy Ilg, N. Mayer, Tonmoy Saikia, M. Keuper, A. Dosovitskiy, T. Brox
- Computer ScienceComputer Vision and Pattern Recognition
- 6 December 2016
The concept of end-to-end learning of optical flow is advanced and it work really well, and faster variants that allow optical flow computation at up to 140fps with accuracy matching the original FlowNet are presented.
MLP-Mixer: An all-MLP Architecture for Vision
- I. Tolstikhin, N. Houlsby, A. Dosovitskiy
- Computer ScienceNeural Information Processing Systems
- 4 May 2021
It is shown that while convolutions and attention are both sufficient for good performance, neither of them are necessary, and MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs), attains competitive scores on image classification benchmarks.
On Evaluation of Embodied Navigation Agents
- Peter Anderson, Angel X. Chang, A. Zamir
- Computer ScienceArXiv
- 18 July 2018
The present document summarizes the consensus recommendations of a working group to study empirical methodology in navigation research and discusses different problem statements and the role of generalization, present evaluation measures, and provides standard scenarios that can be used for benchmarking.
End-to-End Driving Via Conditional Imitation Learning
- Felipe Codevilla, Matthias Müller, A. Dosovitskiy, Antonio M. López, V. Koltun
- Computer ScienceIEEE International Conference on Robotics and…
- 6 October 2017
This work evaluates different architectures for conditional imitation learning in vision-based driving and conducts experiments in realistic three-dimensional simulations of urban driving and on a 1/5 scale robotic truck that is trained to drive in a residential area.
Object-Centric Learning with Slot Attention
- Francesco Locatello, Dirk Weissenborn, Thomas Kipf
- Computer ScienceNeural Information Processing Systems
- 26 June 2020
An architectural component that interfaces with perceptual representations such as the output of a convolutional neural network and produces a set of task-dependent abstract representations which are exchangeable and can bind to any object in the input by specializing through a competitive procedure over multiple rounds of attention is presented.
...
...