Learn More
Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still(More)
We have seen great progress in basic perceptual tasks such as object recognition and detection. However, AI models still fail to match humans in high-level vision tasks due to the lack of capacities for deeper reasoning. Recently the new task of visual question answering (QA) has been proposed to evaluate a model's capacity for deep image understanding.(More)
Reasoning about objects and their affordances is a fundamental problem for visual intelligence. Most of the previous work casts this problem as a classification task where separate classifiers are trained to label objects, recognize attributes, or assign affordances. In this work, we consider the problem of object affordance reasoning using a knowledge base(More)
The complexity of the visual world creates significant challenges for comprehensive visual understanding. In spite of recent successes in visual recognition, a lack of common sense knowledge and the insufficiencies of joint reasoning still leave a large gap between computer vision models and their human counterparts for higher-level tasks such as question(More)
Class-level models capture relational statistics over object attributes and their connecting links, answering questions such as “what is the percentage of friendship pairs where both friends are women?” Class-level relationships are important in themselves, and they support applications like policy making, strategic planning, and query optimization. We(More)
Realistic videos of human actions exhibit rich spatiotem-poral structures at multiple levels of granularity: an action can always be decomposed into multiple finer-grained elements in both space and time. To capture this intuition, we propose to represent videos by a hierarchy of mid-level action elements (MAEs), where each MAE corresponds to an(More)
The complexity of the visual world creates significant challenges for comprehensive visual understanding. In spite of recent successes in visual recognition, today's vision systems would still struggle to deal with visual queries that require a deeper reasoning. We propose a knowledge base (KB) framework to handle an assortment of visual queries, without(More)
— Two less addressed issues of deep reinforcement learning are (1) lack of generalization capability to new target goals, and (2) data inefficiency i.e., the model requires several (and often costly) episodes of trial and error to converge, which makes it impractical to be applied to real-world scenarios. In this paper, we address these two issues and apply(More)
This paper presents an application of vision-based monitoring of long-term care facility residents. We develop an algorithm to detect events of interest, particularly falls by elderly residents. The algorithm uses a max-margin latent variable approach with spatio-temporal locations of the person in the video as latent variables. The recently developed(More)