• Publications
  • Influence
Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension
TLDR
We introduce the task of Multi-Modal Machine Comprehension (M3C), which aims at answering multimodal questions given a context of text, diagrams and images. Expand
  • 104
  • 16
  • PDF
RoboTHOR: An Open Simulation-to-Real Embodied AI Platform
TLDR
We introduce RoboTHOR to democratize research in interactive and embodied visual AI. Expand
  • 15
  • 3
  • PDF
Imagine This! Scripts to Compositions to Videos
TLDR
We present the Composition, Retrieval, and Fusion Network (CRAFT), a model capable of learning this knowledge from video-caption data and applying it while generating videos from novel captions. Expand
  • 21
  • PDF
X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers
TLDR
We introduce X-LXMERT, an extension to LXMERT with training refinements including: discretizing visual representations, using uniform masking with a large range of masking ratios and aligning the right pre-training datasets to the right objectives which enables it to paint. Expand
  • 8
  • PDF
Artificial Agents Learn Flexible Visual Representations by Playing a Hiding Game
TLDR
We show that embodied adversarial reinforcement learning agents playing cache, a variant of hide-and-seek, in a high fidelity, interactive, environment, learn representations of their observations encoding information such as occlusion, object permanence, free space, and containment; on par with representations learnt by the most popular modern paradigm for visual representation learning which requires large datasets independently labeled for each new task. Expand
  • 7
  • PDF
LEARNING GENERALIZABLE VISUAL REPRESENTA-
A growing body of research suggests that embodied gameplay, prevalent not just in human cultures but across a variety of animal species including turtles and ravens, is critical in developing theExpand