• Publications
  • Influence
Zero-Shot Grounding of Objects From Natural Language Queries
TLDR
A new single-stage model called ZSGNet is proposed which combines the detector network and the grounding system and predicts classification scores and regression parameters and achieves state-of-the-art performance on Flickr30k and ReferIt under the usual “seen” settings and performs significantly better than baseline in the zero-shot setting. Expand
Gradient Based Memory Editing for Task-Free Continual Learning
TLDR
This paper proposes a principled approach to "edit" stored examples which aims to carry more updated information from the data stream in the memory and uses gradient updates to edit stored examples so that they are more likely to be forgotten in future updates. Expand
Video Object Grounding Using Semantic Roles in Language Description
TLDR
This work investigates the role of object relations in VOG and proposes a novel framework VOGNet to encode multi-modal object relations via self-attention with relative position encoding, and proposes novel contrasting sampling methods to generate more challenging grounding input samples. Expand
Visually Grounded Continual Learning of Compositional Phrases
TLDR
VisCOLL, a visually grounded language learning task, which simulates the continual acquisition of compositional phrases from streaming visual scenes, and reveals that SoTA continual learning approaches provide little to no improvements on VisCOLL. Expand
Visual Semantic Role Labeling for Video Understanding
TLDR
This work introduces the VidSitu benchmark, a large scale video understanding data source with 29K 10-second movie clips richly annotated with a verb and semantic-roles every 2 seconds, and provides a comprehensive analysis of the dataset in comparison to other publicly available video understanding benchmarks, several illustrative baselines and evaluate a range of standard video recognition models. Expand
Utilizing Every Image Object for Semi-supervised Phrase Grounding
TLDR
This paper proposes to use learned location and subject embedding predictors (LSEP) to generate the corresponding language embeddings for objects lacking annotated queries in the training set and demonstrates that the predictors allow the grounding system to learn from the objects without labeled queries and improve accuracy. Expand
Video Question Answering with Phrases via Semantic Roles
Video Question Answering (VidQA) evaluation metrics have been limited to a single-word answer or selecting a phrase from a fixed set of phrases. These metrics limit the VidQA models’ applicationExpand
Improving Object Detection and Attribute Recognition by Feature Entanglement Reduction
TLDR
This work uses a two-stream model where the category and attribute features are computed independently but the classification heads share Regions of Interest (RoIs) and shows significant improvements over VG-20, a subset of Visual Genome, on both supervised and attribute transfer tasks. Expand