• Publications
  • Influence
Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension
The task of Multi-Modal Machine Comprehension (M3C), which aims at answering multimodal questions given a context of text, diagrams and images, is introduced and state-of-the-art methods for textual machine comprehension and visual question answering are extended to the TQA dataset.
RoboTHOR: An Open Simulation-to-Real Embodied AI Platform
RoboTHOR offers a framework of simulated environments paired with physical counterparts to systematically explore and overcome the challenges of simulation-to-real transfer, and a platform where researchers across the globe can remotely test their embodied models in the physical world.
X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers
X-LXMERT is introduced, an extension to LXMERT with training refinements including: discretizing visual representations, using uniform masking with a large range of masking ratios and aligning the right pre-training datasets to the right objectives which enables it to paint.
Imagine This! Scripts to Compositions to Videos
This work presents the Composition, Retrieval, and Fusion Network (CRAFT), a model capable of learning knowledge from video-caption data and applying it while generating videos from novel captions, and evaluates CRAFT on semantic fidelity to caption, composition consistency, and visual quality.
Artificial Agents Learn Flexible Visual Representations by Playing a Hiding Game
This work is the first to show that embodied adversarial reinforcement learning agents playing cache, a variant of hide-and-seek, in a high fidelity, interactive, environment, learn representations of their observations encoding information such as occlusion, object permanence, free space, and containment.
Iconary: A Pictionary-Based Game for Testing Multimodal Communication with Drawings and Text
This work proposes models to play Iconary, a collaborative game of drawing and guessing based on Pictionary, that poses a novel challenge for the research community and proposes models that are skillful players and able to employ world knowledge in language models toplay with words unseen during training.