• Publications
  • Influence
Cross-Stitch Networks for Multi-task Learning
Multi-task learning in Convolutional Networks has displayed remarkable success in the field of recognition. This success can be largely attributed to learning shared representations from multipleExpand
  • 362
  • 63
Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification
In this paper, we present an approach for learning a visual representation from the raw spatiotemporal signals in videos. Our representation is learned without supervision from semantic labels. WeExpand
  • 297
  • 40
Visual Storytelling
We introduce the first dataset for sequential vision-to-language, and explore how this data may be used for the task of visual storytelling. The first release of this dataset, SIND1 v.1, includesExpand
  • 146
  • 22
Generating Natural Questions About an Image
There has been an explosion of work in the vision & language community during the past few years from image captioning to video transcription, and answering questions about images. These tasks haveExpand
  • 158
  • 18
Cut, Paste and Learn: Surprisingly Easy Synthesis for Instance Detection
A major impediment in rapidly deploying object detection models for instance detection is the lack of large annotated datasets. For example, finding a large labeled dataset containing instances in aExpand
  • 140
  • 18
Self-Supervised Learning of Pretext-Invariant Representations
The goal of self-supervised learning from images is to construct image representations that are semantically meaningful via pretext tasks that do not require semantic annotations. Many pretext tasksExpand
  • 65
  • 13
From Red Wine to Red Tomato: Composition with Context
Compositionality and contextuality are key building blocks of intelligence. They allow us to compose known concepts to generate new and complex ones. However, traditional learning methods do notExpand
  • 71
  • 12
Scaling and Benchmarking Self-Supervised Visual Representation Learning
Self-supervised learning aims to learn representations from the data itself without explicit manual supervision. Existing efforts ignore a crucial aspect of self-supervised learning - the ability toExpand
  • 65
  • 8
Unsupervised Learning using Sequential Verification for Action Recognition
In this paper, we consider the problem of learning a visual representation from the raw spatiotemporal signals in videos for use in action recognition. Our representation is learned withoutExpand
  • 44
  • 8
Seeing through the Human Reporting Bias: Visual Classifiers from Noisy Human-Centric Labels
When human annotators are given a choice about what to label in an image, they apply their own subjective judgments on what to ignore and what to mention. We refer to these noisy "human-centric"Expand
  • 104
  • 4