Abhinav Gupta

Learn More
This paper proposes a conceptually simple but surprisingly powerful method which combines the effectiveness of a discriminative object detector with the explicit correspondence offered by a nearest-neighbor approach. The method is based on training a separate linear SVM classifier for every exemplar in the training set. Each of these Exemplar-SVMs is thus(More)
The goal of this paper is to discover a set of discriminative patches which can serve as a fully unsupervised mid-level visual representation. The desired patches need to satisfy two requirements: 1) to be representative, they need to occur frequently enough in the visual world; 2) to be discriminative, they need to be different enough from the rest of the(More)
Interpretation of images and videos containing humans interacting with different objects is a daunting task. It involves understanding scene or event, analyzing human movements, recognizing manipulable objects, and observing the effect of the human movement on those objects. While each of these perceptual tasks can be conducted independently, recognition(More)
Recent work on mid-level visual representations aims to capture information at the level of complexity higher than typical “visual words”, but lower than full-blown semantic objects. Several approaches [5, 6, 12, 23] have been proposed to discover mid-level visual elements, that are both 1) representative, i.e., frequently occurring within a visual dataset,(More)
This work explores the use of spatial context as a source of free and plentiful supervisory signal for training a rich visual representation. Given only a large, unlabeled image collection, we extract random pairs of patches from each image and train a convolutional neural net to predict the position of the second patch relative to the first. We argue that(More)
We propose NEIL (Never Ending Image Learner), a computer program that runs 24 hours per day and 7 days per week to automatically extract visual knowledge from Internet data. NEIL uses a semi-supervised learning algorithm that jointly discovers common sense relationships (e.g., "Corolla is a kind of/looks similar to Car", "Wheel is a part of Car") and labels(More)
The Visual Object Tracking challenge 2015, VOT2015, aims at comparing short-term single-object visual trackers that do not apply pre-learned models of object appearance. Results of 62 trackers are presented. The number of tested trackers makes VOT 2015 the largest benchmark on shortterm tracking to date. For each participating tracker, a short description(More)
Since most current scene understanding approaches operate either on the 2D image or using a surface-based representation, they do not allow reasoning about the physical constraints within the 3D scene. Inspired by the “Blocks World” work in the 1960’s, we present a qualitative physical representation of an outdoor scene where objects have volume and mass,(More)
There has been a recent push in extraction of 3D spatial layout of scenes. However, none of these approaches model the 3D interaction between objects and the spatial layout. In this paper, we argue for a parametric representation of objects in 3D, which allows us to incorporate volumetric constraints of the physical world. We show that augmenting current(More)
The goal of this work is to find <i>visually similar</i> images even if they appear quite different at the raw pixel level. This task is particularly important for matching images across visual domains, such as photos taken over different seasons or lighting conditions, paintings, hand-drawn sketches, etc. We propose a surprisingly simple method that(More)