• Publications
  • Influence
Spatially Aware Multimodal Transformers for TextVQA
A novel spatially aware self-attention layer such that each visual entity only looks at neighboring entities defined by a spatial graph, and each head in this multi-head self-Attention layer focuses on a different subset of relations. Expand
Contrast and Classify: Alternate Training for Robust VQA
A novel training paradigm (ConCAT) that alternately optimizes cross-entropy and contrastive losses is proposed that achieves higher consensus scores on the VQA-Rephrasings dataset as well as higher V QA accuracy on theVQA 2.0 dataset compared to existing approaches across a variety of data augmentation strategies. Expand
ICLR Reproducibility Challenge Report (Padam : Closing The Generalization Gap Of Adaptive Gradient Methods in Training Deep Neural Networks)
A new optimization algorithm is designed that bridges the gap between the space of Adaptive Gradient algorithms and SGD with momentum and a new tunable hyperparameter called partially adaptive parameter p is introduced that varies between [0, 0.5]. Expand
Contrast and Classify: Training Robust VQA Models
Recent Visual Question Answering (VQA) models have shown impressive performance on the VQA benchmark but remain sensitive to small linguistic variations in input questions. Existing approachesExpand
Automated Video Description for Blind and Low Vision Users
Results from a pilot study with eight blind video aficionados indicate the promise of this system for meeting needs for immediate access to videos and validate the efforts in developing tools in partnership with the individuals the authors aim to benefit. Expand
Contrast and Classify: Training Robust VQA Models (Supplementary)
To know whether the gradients of both the losses (LSSC and LCE) are aligned with each other during training, we follow the gradient surgery setup of [8] for multi-task learning. DuringExpand
NarrationBot and InfoBot: A Hybrid System for Automated Video Description
Results from a mixed-methods study with 26 blind and low vision individuals show that the developed hybrid system significantly improved user comprehension and enjoyment of selected videos when both tools were used in tandem. Expand