• Publications
  • Influence
Non-local Neural Networks
This paper presents non-local operations as a generic family of building blocks for capturing long-range dependencies in computer vision and improves object detection/segmentation and pose estimation on the COCO suite of tasks.
Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding
This work proposes a novel Hollywood in Homes approach to collect data, collecting a new dataset, Charades, with hundreds of people recording videos in their own homes, acting out casual everyday activities, and evaluates and provides baseline results for several tasks including action recognition and automatic description generation.
Videos as Space-Time Region Graphs
The proposed graph representation achieves state-of-the-art results on the Charades and Something-Something datasets and obtains a huge gain when the model is applied in complex environments.
Zero-Shot Recognition via Semantic Embeddings and Knowledge Graphs
This paper builds upon the recently introduced Graph Convolutional Network (GCN) and proposes an approach that uses both semantic embeddings and the categorical relationships to predict the classifiers, and shows that it is robust to noise in the KG.
Unsupervised Learning of Visual Representations Using Videos
  • X. Wang, A. Gupta
  • Computer Science
    IEEE International Conference on Computer Vision…
  • 4 May 2015
A simple yet surprisingly powerful approach for unsupervised learning of CNN that uses hundreds of thousands of unlabeled videos from the web to learn visual representations and designs a Siamese-triplet network with a ranking loss function to train this CNN representation.
A-Fast-RCNN: Hard Positive Generation via Adversary for Object Detection
This paper proposes to learn an adversarial network that generates examples with occlusions and deformations, the goal of the adversary is to generate examples that are difficult for the object detector to classify and both the original detector and adversary are learned in a joint manner.
Actions ~ Transformations
A novel representation for actions is proposed by modeling an action as a transformation which changes the state of the environment before the action happens (precondition) to the state after the action (effect).
Learning Correspondence From the Cycle-Consistency of Time
A self-supervised method to use cycle-consistency in time as free supervisory signal for learning visual representations from scratch and demonstrates the generalizability of the representation -- without finetuning -- across a range of visual correspondence tasks, including video object segmentation, keypoint tracking, and optical flow.
Designing deep networks for surface normal estimation
This paper proposes to build upon the decades of hard work in 3D scene understanding to design a new CNN architecture for the task of surface normal estimation and shows that incorporating several constraints and meaningful intermediate representations in the architecture leads to state of the art performance on surfacenormal estimation.
Visual Semantic Navigation using Scene Priors
This work proposes to use Graph Convolutional Networks for incorporating the prior knowledge into a deep reinforcement learning framework and shows how semantic knowledge improves performance significantly and improves in generalization to unseen scenes and/or objects.