• Publications
  • Influence
Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension
TLDR
The task of Multi-Modal Machine Comprehension (M3C), which aims at answering multimodal questions given a context of text, diagrams and images, is introduced and state-of-the-art methods for textual machine comprehension and visual question answering are extended to the TQA dataset. Expand
RoboTHOR: An Open Simulation-to-Real Embodied AI Platform
TLDR
RoboTHOR offers a framework of simulated environments paired with physical counterparts to systematically explore and overcome the challenges of simulation-to-real transfer, and a platform where researchers across the globe can remotely test their embodied models in the physical world. Expand
X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers
TLDR
X-LXMERT is introduced, an extension to LXMERT with training refinements including: discretizing visual representations, using uniform masking with a large range of masking ratios and aligning the right pre-training datasets to the right objectives which enables it to paint. Expand
Imagine This! Scripts to Compositions to Videos
TLDR
This work presents the Composition, Retrieval, and Fusion Network (CRAFT), a model capable of learning knowledge from video-caption data and applying it while generating videos from novel captions, and evaluates CRAFT on semantic fidelity to caption, composition consistency, and visual quality. Expand
Learning Generalizable Visual Representations via Interactive Gameplay
TLDR
This work is the first to show that embodied adversarial reinforcement learning agents playing cache, a variant of hide-and-seek, in a high fidelity, interactive, environment, learn representations of their observations encoding information such as occlusion, object permanence, free space, and containment. Expand
Iconary: A Pictionary-Based Game for Testing Multimodal Communication with Drawings and Text
Communicating with humans is challenging for AIs because it requires a shared understanding of the world, complex semantics (e.g., metaphors or analogies), and at times multimodal gestures (e.g.,Expand