• Publications
  • Influence
Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors
A unified implementation of the Faster R-CNN, R-FCN and SSD systems is presented and the speed/accuracy trade-off curve created by using alternative feature extractors and varying other critical parameters such as image size within each of these meta-architectures is traced out.
Learning to recognize objects in egocentric activities
The key to this approach is a robust, unsupervised bottom up segmentation method, which exploits the structure of the egocentric domain to partition each frame into hand, object, and background categories and uses Multiple Instance Learning to match object instances across sequences.
Learning to Recognize Daily Actions Using Gaze
An inference method is presented that can predict the best sequence of gaze locations and the associated action label from an input sequence of images and demonstrates improvements in action recognition rates and gaze prediction accuracy relative to state-of-the-art methods.
Action recognition by learning mid-level motion features
  • A. Fathi, Greg Mori
  • Computer Science
    IEEE Conference on Computer Vision and Pattern…
  • 23 June 2008
A method constructing mid-level motion features which are built from low-level optical flow information are developed, tuned to discriminate between different classes of action, and are efficient to compute at run-time.
Understanding egocentric activities
This work presents a method to analyze daily activities using video from an egocentric camera, and shows that joint modeling of activities, actions, and objects leads to superior performance in comparison to the case where they are considered independently.
Social interactions: A first-person perspective
Encouraging results on detection and recognition of social interactions in first-person videos captured from multiple days of experience in amusement parks are demonstrated.
Tracking Emerges by Colorizing Videos
The natural temporal coherency of color is leveraged to create a model that learns to colorize gray-scale videos by copying colors from a reference frame, which learns to track well enough to outperform the latest methods based on optical flow.
Learning to Predict Gaze in Egocentric Video
A model for gaze prediction in egocentric video is presented by leveraging the implicit cues that exist in camera wearer's behaviors and model the dynamic behavior of the gaze, in particular fixations, as latent variables to improve the gaze prediction.
VideoSET: Video Summary Evaluation through Text
This paper presents VideoSET, a method for Video Summary Evaluation through Text that can evaluate how well a video summary is able to retain the semantic information contained in its original video, and develops a text-based approach for the evaluation.
Reasoning about Object Affordances in a Knowledge Base Representation
This work learns a knowledge base (KB) using a Markov Logic Network (MLN) and shows that a diverse set of visual inference tasks can be done in this unified framework without training separate classifiers, including zero-shot affordance prediction and object recognition given human poses.