• Publications
  • Influence
VQA: Visual Question Answering
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language
VQA: Visual Question Answering
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language
Zero-Shot Learning via Visual Abstraction
TLDR
This paper proposes a new modality for ZSL using visual abstraction to learn difficult-to-describe concepts related to people and their interactions with others, and learns an explicit mapping between the abstract and real worlds.
We are Humor Beings: Understanding and Predicting Visual Humor
TLDR
This work analyzes the humor manifested in abstract scenes and design computational models for them, and model two tasks that it is believed demonstrate an understanding of some aspects of visual humor.
Resolving Language and Vision Ambiguities Together: Joint Segmentation & Prepositional Attachment Resolution in Captioned Scenes
TLDR
This work presents an approach to simultaneously perform semantic segmentation and prepositional phrase attachment resolution for captioned images and shows that joint reasoning produces more accurate results than any module operating in isolation.
Measuring Machine Intelligence Through Visual Question Answering
TLDR
A case study exploring the recently popular task of image captioning and its limitations as a task for measuring machine intelligence, and an alternative and more promising task that tests a machine’s ability to reason about language and vision.
A New Approach for Measuring Terrain Profiles
This paper presents a new approach for measuring terrain profiles. The proposed approach uses RGB-D sensors to measure terrain surface relative to the vehicle. Since the RGB-D sensor is an area
Internship: probing joint vision-and-language representations
TLDR
This internship is focused on exploring representations trained to perform multimodal processing involving images and texts, including visual question answering, visual dialog, image captionning, and text undersanding.
Grid-Based Scan-to-Map Matching for Accurate Simultaneous Localization and Mapping: Theory and Preliminary Numerical Study
TLDR
Experimental results first show that the proposed technique successfully matches new scans to the map generating very small position and orientation errors, and then demonstrates the effectiveness of the multi- ND representation in comparison to the single-ND representation.
...
1
2
...