• Publications
  • Influence
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a
Graph R-CNN for Scene Graph Generation
A novel scene graph generation model called Graph R-CNN, that is both effective and efficient at detecting objects and their relations in images, is proposed and a new evaluation metric is introduced that is more holistic and realistic than existing metrics.
Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning
This work poses a cooperative ‘image guessing’ game between two agents who communicate in natural language dialog so that Q-BOT can select an unseen image from a lineup of images and shows the emergence of grounded language and communication among ‘visual’ dialog agents with no human supervision.
Embodied Question Answering
A new AI task where an agent is spawned at a random location in a 3D environment and asked a question ('What color is the car?'), and the agent must first intelligently navigate to explore the environment, gather necessary visual information through first-person (egocentric) vision, and then answer the question.
Lending A Hand: Detecting Hands and Recognizing Activities in Complex Egocentric Interactions
This work develops methods to locate and distinguish between hands in egocentric video using strong appearance models with Convolutional Neural Networks, and introduces a simple candidate region generation approach that outperforms existing techniques at a fraction of the computational cost.
Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models
Diverse Beam Search is proposed, an alternative to BS that decodes a list of diverse outputs by optimizing for a diversity-augmented objective and consistently outperforms BS and previously proposed techniques for diverse decoding from neural sequence models.
Overcoming Language Priors in Visual Question Answering with Adversarial Regularization
This work introduces a question-only model that takes as input the question encoding from the VQA model and must leverage language biases in order to succeed, and poses training as an adversarial game between this model and this question- only adversary -- discouraging the V QA model from capturing language bias in its question encoding.
Stochastic Multiple Choice Learning for Training Diverse Deep Ensembles
This work poses the task of producing multiple outputs as a learning problem over an ensemble of deep networks -- introducing a novel stochastic gradient descent based approach to minimize the loss with respect to an oracle.
Diverse Beam Search for Improved Description of Complex Scenes
Diverse Beam Search is proposed, a diversity promoting alternative to BS for approximate inference that produces sequences that are significantly different from each other by incorporating diversity constraints within groups of candidate sequences during decoding; moreover, it achieves this with minimal computational or memory overhead.
Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded
This work proposes a generic approach called Human Importance-aware Network Tuning (HINT), which effectively leverages human demonstrations to improve visual grounding and encourages deep networks to be sensitive to the same input regions as humans.