• Publications
  • Influence
Skip-Thought Vectors
We describe an approach for unsupervised learning of a generic, distributed sentence encoder. Using the continuity of text from books, we train an encoder-decoder model that tries to reconstruct theExpand
Scene Parsing through ADE20K Dataset
The ADE20K dataset, spanning diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts, is introduced and it is shown that the trained scene parsing networks can lead to applications such as image content removal and scene synthesis. Expand
Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books
To align movies and books, a neural sentence embedding that is trained in an unsupervised way from a large corpus of books, as well as a video-text neural embedding for computing similarities between movie clips and sentences in the book are proposed. Expand
The Role of Context for Object Detection and Semantic Segmentation in the Wild
A novel deformable part-based model is proposed, which exploits both local context around each candidate detection as well as global context at the level of the scene, which significantly helps in detecting objects at all scales. Expand
VSE++: Improving Visual-Semantic Embeddings with Hard Negatives
A simple change to common loss functions used for multi-modal embeddings, inspired by hard negative mining, the use of hard negatives in structured prediction, and ranking loss functions, is introduced, which yields significant gains in retrieval performance. Expand
Detect What You Can: Detecting and Representing Objects Using Holistic Models and Body Parts
This work proposes a novel approach to handle large deformations and partial occlusions in animals in terms of body parts, and applies it to the six animal categories in the PASCAL VOC dataset and shows that it significantly improves state-of-the-art (by 4.1% AP) and provides a richer representation for objects. Expand
Monocular 3D Object Detection for Autonomous Driving
This work proposes an energy minimization approach that places object candidates in 3D using the fact that objects should be on the ground-plane, and achieves the best detection performance on the challenging KITTI benchmark, among published monocular competitors. Expand
Semantic Understanding of Scenes Through the ADE20K Dataset
This work presents a densely annotated dataset ADE20K, which spans diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts, and shows that the networks trained on this dataset are able to segment a wide variety of scenes and objects. Expand
Order-Embeddings of Images and Language
A general method for learning ordered representations is introduced, and it is shown that the resulting representations improve performance over current approaches for hypernym prediction and image-caption retrieval. Expand
Scaling Egocentric Vision: The EPIC-KITCHENS Dataset
This paper introduces EPIC-KITCHENS, a large-scale egocentric video benchmark recorded by 32 participants in their native kitchen environments, and had the participants narrate their own videos (after recording), thus reflecting true intention, and crowd-sourced ground-truths based on these. Expand