• Publications
  • Influence
Skip-Thought Vectors
We describe an approach for unsupervised learning of a generic, distributed sentence encoder. Using the continuity of text from books, we train an encoder-decoder model that tries to reconstruct theExpand
  • 1,500
  • 232
  • PDF
Scene Parsing through ADE20K Dataset
Scene parsing, or recognizing and segmenting objects and stuff in an image, is one of the key problems in computer vision. Despite the communitys efforts in data collection, there are still few imageExpand
  • 606
  • 130
  • PDF
Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books
Books are a rich source of both fine-grained information, how a character, an object or a scene looks like, as well as high-level semantics, what someone is thinking, feeling and how these statesExpand
  • 726
  • 105
  • PDF
The Role of Context for Object Detection and Semantic Segmentation in the Wild
In this paper we study the role of context in existing state-of-the-art detection and segmentation approaches. Towards this goal, we label every pixel of PASCAL VOC 2010 detection challenge with aExpand
  • 624
  • 104
  • PDF
Detect What You Can: Detecting and Representing Objects Using Holistic Models and Body Parts
Detecting objects becomes difficult when we need to deal with large shape deformation, occlusion and low resolution. We propose a novel approach to i) handle large deformations and partial occlusionsExpand
  • 331
  • 74
  • PDF
Monocular 3D Object Detection for Autonomous Driving
The goal of this paper is to perform 3D object detection from a single monocular image in the domain of autonomous driving. Our method first aims to generate a set of candidate class-specific objectExpand
  • 366
  • 66
  • PDF
Order-Embeddings of Images and Language
Hypernymy, textual entailment, and image captioning can be seen as special cases of a single visual-semantic hierarchy over words, sentences, and images. In this paper we advocate for explicitlyExpand
  • 338
  • 64
  • PDF
VSE++: Improving Visual-Semantic Embeddings with Hard Negatives
We present a new technique for learning visual-semantic embeddings for cross-modal retrieval. Inspired by the use of hard negatives in structured prediction, and ranking loss functions used inExpand
  • 236
  • 58
  • PDF
MovieQA: Understanding Stories in Movies through Question-Answering
We introduce the MovieQA dataset which aims to evaluate automatic story comprehension from both video and text. The dataset consists of 14,944 questions about 408 movies with high semantic diversity.Expand
  • 316
  • 53
  • PDF
Semantic Understanding of Scenes Through the ADE20K Dataset
Semantic understanding of visual scenes is one of the holy grails of computer vision. Despite efforts of the community in data collection, there are still few image datasets covering a wide range ofExpand
  • 321
  • 49
  • PDF