• Publications
  • Influence
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
TLDR
The Visual Genome dataset is presented, which contains over 108K images where each image has an average of $$35$$35 objects, $$26$$26 attributes, and $$21$$21 pairwise relationships between objects, and represents the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answer pairs. Expand
Scene Graph Generation by Iterative Message Passing
TLDR
This work explicitly model the objects and their relationships using scene graphs, a visually-grounded graphical structure of an image, and proposes a novel end-to-end model that generates such structured scene representation from an input image. Expand
DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion
TLDR
DenseFusion is a generic framework for estimating 6D pose of a set of known objects from RGB-D images that processes the two data sources individually and uses a novel dense fusion network to extract pixel-wise dense feature embedding, from which the pose is estimated. Expand
Visual7W: Grounded Question Answering in Images
TLDR
A semantic link between textual descriptions and image regions by object-level grounding enables a new type of QA with visual answers, in addition to textual answers used in previous work, and proposes a novel LSTM model with spatial attention to tackle the 7W QA tasks. Expand
AI2-THOR: An Interactive 3D Environment for Visual AI
TLDR
AI2-THOR consists of near photo-realistic 3D indoor scenes, where AI agents can navigate in the scenes and interact with objects to perform tasks and facilitate building visually intelligent models. Expand
Target-driven visual navigation in indoor scenes using deep reinforcement learning
TLDR
This paper proposes an actor-critic model whose policy is a function of the goal as well as the current state, which allows better generalization and proposes the AI2-THOR framework, which provides an environment with high-quality 3D scenes and a physics engine. Expand
Reasoning about Object Affordances in a Knowledge Base Representation
TLDR
This work learns a knowledge base (KB) using a Markov Logic Network (MLN) and shows that a diverse set of visual inference tasks can be done in this unified framework without training separate classifiers, including zero-shot affordance prediction and object recognition given human poses. Expand
Reinforcement and Imitation Learning for Diverse Visuomotor Skills
TLDR
This work proposes a model-free deep reinforcement learning method that leverages a small amount of demonstration data to assist a reinforcement learning agent and trains end-to-end visuomotor policies that map directly from RGB camera inputs to joint velocities. Expand
Neural Task Programming: Learning to Generalize Across Hierarchical Tasks
TLDR
A novel robot learning framework called Neural Task Programming (NTP), which bridges the idea of few-shot learning from demonstration and neural program induction, and achieves strong generalization across sequential tasks that exhibit hierarchal and compositional structures. Expand
SURREAL: Open-Source Reinforcement Learning Framework and Robot Manipulation Benchmark
TLDR
SURREAL, an open-source scalable framework that supports state-of-the-art distributed reinforcement learning algorithms, is introduced, which demonstrates that SURREAL algorithms outperform existing opensource implementations in both agent performance and learning efficiency. Expand
...
1
2
3
4
5
...