• Publications
  • Influence
Human Attention in Visual Question Answering: Do Humans and Deep Networks look at the same regions?
TLDR
The VQA-HAT (Human ATtention) dataset is introduced and attention maps generated by state-of-the-art V QA models are evaluated against human attention both qualitatively and quantitatively. Expand
nocaps: novel object captioning at scale
TLDR
This work presents the first large-scale benchmark for novel object captioning at scale, ‘nocaps’, consisting of 166,100 human-generated captions describing 15,100 images from the Open Images validation and test sets and provides analysis to guide future work. Expand
Spatially Aware Multimodal Transformers for TextVQA
TLDR
A novel spatially aware self-attention layer such that each visual entity only looks at neighboring entities defined by a spatial graph, and each head in this multi-head self-Attention layer focuses on a different subset of relations. Expand
Object-Proposal Evaluation Protocol is ‘Gameable’
TLDR
It is argued that the choice of using a partially annotated dataset for evaluation of object proposals is problematic, and the evaluation protocol is 'gameable', in the sense that progress under this protocol does not necessarily correspond to a "better" category independent object proposal algorithm. Expand
Sort Story: Sorting Jumbled Images and Captions into Stories
TLDR
This work proposes the task of sequencing -- given a jumbled set of aligned image-caption pairs that belong to a story, the task is to sort them such that the output sequence forms a coherent story, and presents multiple approaches, via unary and pairwise predictions, and their ensemble-based combinations, achieving strong results. Expand
EvalAI: Towards Better Evaluation Systems for AI Agents
TLDR
EvalAI is built to provide a scalable solution to the research community to fulfill the critical need of evaluating machine learning models and agents acting in an environment against annotations or with a human-in-the-loop. Expand
Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning
TLDR
This work proposes Seq-CVAE which learns a latent space for every word which encourages this temporal latent space to capture the 'intention' about how to complete the sentence by mimicking a representation which summarizes the future. Expand
CloudCV: Large-Scale Distributed Computer Vision as a Cloud Service
TLDR
The goal is to democratize computer vision; one should not have to be a computer vision, big data and distributed computing expert to have access to state-of-the-art distributed computer vision algorithms. Expand
Fabrik: An Online Collaborative Neural Network Editor
TLDR
Fabrik provides a simple and intuitive GUI to import neural networks written in popular deep learning frameworks such as Caffe, Keras, and TensorFlow, and allows users to interact with, build, and edit models via simple drag and drop. Expand
...
1
2
...