Learn More
In this work we introduce a fully end-to-end approach for action detection in videos that learns to directly predict the temporal bounds of actions. Our intuition is that the process of detecting actions is naturally one of observation and refinement: observing moments in video, and refining hypotheses about when an action is occurring. Based on this(More)
Every moment counts in action recognition. A comprehensive understanding of human activity in video requires labeling every frame according to the actions occurring, placing multiple labels densely over a video sequence. To study this problem we extend the existing THUMOS dataset and introduce MultiTHUMOS, a new dataset of dense labels over unconstrained(More)
We propose a viewpoint invariant model for 3D human pose estimation from a single depth image. To achieve this, our discrimina-tive model embeds local regions into a learned viewpoint invariant feature space. Formulated as a multi-task learning problem, our model is able to selectively predict partial poses in the presence of noise and occlusion. Our(More)
We consider the problem of grasping novel objects with a robotic arm. A recent successful technique applies machine learning to identify a point in an image corresponding to the most likely location at which to grasp an object. Another approach extends their method to accomodate grasps with multiple contact points. This paper proposes a novel approach that(More)
Understanding the simultaneously very diverse and intricately fine-grained set of possible human actions is a critical open problem in computer vision. Manually labeling training videos is feasible for some action classes but doesn’t scale to the full long-tailed distribution of actions. A promising way to address this is to leverage noisy data from web(More)