• Publications
  • Influence
AI2-THOR: An Interactive 3D Environment for Visual AI
TLDR
AI2-THOR consists of near photo-realistic 3D indoor scenes, where AI agents can navigate in the scenes and interact with objects to perform tasks and facilitate building visually intelligent models.
ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks
TLDR
It is shown that a baseline model based on recent embodied vision-and-language tasks performs poorly on ALFRED, suggesting that there is significant room for developing innovative grounded visual language understanding models with this benchmark.
IQA: Visual Question Answering in Interactive Environments
TLDR
The Hierarchical Interactive Memory Network (HIMN), consisting of a factorized set of controllers, allowing the system to operate at multiple levels of temporal abstraction, is proposed, and outperforms popular single controller based methods on IQUAD V1.
Re$^3$: Re al-Time Recurrent Regression Networks for Visual Tracking of Generic Objects
TLDR
This paper presents a real-time deep object tracker capable of incorporating temporal information into its model and shows that the method handles temporary occlusion better than other comparable trackers using experiments that directly measure performance on sequences with Occlusion.
Visual Semantic Planning Using Deep Successor Representations
TLDR
This work addresses the problem of visual semantic planning: the task of predicting a sequence of actions from visual observations that transform a dynamic environment from an initial state to a goal state, and develops a deep predictive model based on successor representations.
Watching the World Go By: Representation Learning from Unlabeled Videos
TLDR
Video Noise Contrastive Estimation is proposed, a method for using unlabeled video to learn strong, transferable single image representations that demonstrate improvements over recent unsupervised single image techniques, as well as over fully supervised ImageNet pretraining, across a variety of temporal and non-temporal tasks.
Re3 : Real-Time Recurrent Regression Networks for Object Tracking
TLDR
Re is presented, a real-time deep object tracker capable of incorporating long-term temporal information into its model, using a recurrent neural network to represent the appearance and motion of the object.
SplitNet: Sim2Sim and Task2Task Transfer for Embodied Visual Navigation
We propose SplitNet, a method for decoupling visual perception and policy learning. By incorporating auxiliary tasks and selective learning of portions of the model, we explicitly decompose the
Shifting the Baseline: Single Modality Performance on Visual Navigation & QA
TLDR
It is argued that unimodal approaches better capture and reflect dataset biases and therefore provide an important comparison when assessing the performance of multimodal techniques.
Collaborative rephotography
TLDR
Here the ability for smart-phone apps to guide a user to the correct viewpoint is promoted, by enabling collaborative projects that allow multiple users to re-photograph multiple sites over time.
...
1
2
...