• Publications
  • Influence
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
TLDR
This paper proposes an online algorithm, SwAV, that takes advantage of contrastive methods without requiring to compute pairwise comparisons, and uses a swapped prediction mechanism where it predicts the cluster assignment of a view from the representation of another view.
Emerging Properties in Self-Supervised Vision Transformers
TLDR
This paper questions if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets) and implements DINO, a form of self-distillation with no labels, which implements the synergy between DINO and ViTs.
Cross-Stitch Networks for Multi-task Learning
TLDR
This paper proposes a principled approach to learn shared representations in Convolutional Networks using multitask learning using a new sharing unit: "cross-stitch" unit that combines the activations from multiple networks and can be trained end-to-end.
Barlow Twins: Self-Supervised Learning via Redundancy Reduction
TLDR
This work proposes an objective function that naturally avoids collapse by measuring the cross-correlation matrix between the outputs of two identical networks fed with distorted versions of a sample, and making it as close to the identity matrix as possible.
Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification
TLDR
This paper forms an approach for learning a visual representation from the raw spatiotemporal signals in videos using a Convolutional Neural Network, and shows that this method captures information that is temporally varying, such as human pose.
Self-Supervised Learning of Pretext-Invariant Representations
TLDR
This work develops Pretext-Invariant Representation Learning (PIRL), a new state-of-the-art in self-supervised learning from images that learns invariant representations based on pretext tasks that substantially improves the semantic quality of the learned image representations.
Cut, Paste and Learn: Surprisingly Easy Synthesis for Instance Detection
TLDR
This paper proposes a simple approach to generate large annotated instance datasets with minimal effort and outperforms existing synthesis approaches and when combined with real images improves relative performance by more than 21% on benchmark datasets.
Visual Storytelling
TLDR
Modelling concrete description as well as figurative and social language, as provided in this dataset and the storytelling task, has the potential to move artificial intelligence from basic understandings of typical visual scenes towards more and more human-like understanding of grounded event structure and subjective expression.
From Red Wine to Red Tomato: Composition with Context
TLDR
This paper presents a simple method that respects contextuality in order to compose classifiers of known visual concepts and builds upon the intuition that classifiers lie in a smooth space where compositional transforms can be modeled.
In Defense of Grid Features for Visual Question Answering
TLDR
This paper revisits grid features for VQA, and finds they can work surprisingly well -- running more than an order of magnitude faster with the same accuracy (e.g. if pre-trained in a similar fashion).
...
...