Who Let the Dogs Out? Modeling Dog Behavior from Visual Data

@article{Ehsani2018WhoLT,
  title={Who Let the Dogs Out? Modeling Dog Behavior from Visual Data},
  author={Kiana Ehsani and Hessam Bagherinezhad and Joseph Redmon and Roozbeh Mottaghi and Ali Farhadi},
  journal={2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2018},
  pages={4051-4060}
}
We study the task of directly modelling a visually intelligent agent. [] Key Method Using this data we model how the dog acts and how the dog plans her movements. We show under a variety of metrics that given just visual input we can successfully model this intelligent agent in many situations. Moreover, the representation learned by our model encodes distinct information compared to representations trained on image classification, and our learned representation can generalize to other domains. In particular…

Figures and Tables from this paper

What Can You Learn from Your Muscles? Learning Visual Representation from Human Interactions

Experiments show that the self-supervised representation that encodes interaction and attention cues outperforms a visual-only state-of-the-art method MoCo on a variety of target tasks: scene classification (semantic), action recognition (temporal), depth estimation (geometric), dynamics prediction (physics) and walkable surface estimation (affordance).

Through a Dog’s Eyes: fMRI Decoding of Naturalistic Videos from Dog Cortex

These results demonstrate the first known application of machine learning to decode naturalistic videos from the brain of a carnivore and suggest that the dog’s-eye view of the world may be quite different than the authors' own.

From Recognition to Cognition: Visual Commonsense Reasoning

To move towards cognition-level understanding, a new reasoning engine is presented, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning.

What is My Dog doing Now ? Modeling Dog Behavior from Live Stream

This project is going to explore the videos streams from a dog camera, applying computation vision techniques and trying to better understand of their behaviors when the owner is away from home, to close the gap by modeling dog postures as well as motions using image processing technologies.

PlaTe: Visually-Grounded Planning With Transformers in Procedural Tasks

This work addresses the problem of how to leverage instructional videos to facilitate the understanding of human decision-making processes, focusing on training a model with the ability to plan a goal-directed procedure from real-world videos with Planning Transformer (PlaTe), which has the advantage of circumventing the compounding prediction errors that occur with single-step models during long model-based rollouts.

TAB-VCR: Tags and Attributes based Visual Commonsense Reasoning Baselines

A much simpler model obtained by ablating and pruning the existing intricate baseline can perform better with half the number of trainable parameters, and is obtained for the new visual commonsense reasoning (VCR) task, TAB-VCR.

Who's a Good Boy? Reinforcing Canine Behavior in Real-Time using Machine Learning

This paper outlines the development methodology for an automatic dog treat dispenser which combines machine learning and embedded hardware to identify and reward dog behaviors in real-time and reinforces positive actions by making inference on the Jetson Nano and transmitting a signal to a servo motor to release rewards from a treat delivery apparatus.

Who’s a Good Boy? Reinforcing Canine Behavior using Machine Learning in Real-Time

This paper outlines the development methodology for an automatic dog treat dispenser which combines machine learning and embedded hardware to identify and reward dog behaviors in real-time and reinforces positive actions by making inference on the Jetson Nano and transmitting a signal to a servo motor to release rewards from a treat delivery apparatus.

On the Role of Event Boundaries in Egocentric Activity Recognition from Photostreams

Insight is provided of how automatically computed boundaries can impact activity recognition results in the emerging domain of egocentric photostreams and the generalization capabilities of several deep learning based architectures to unseen users.

Procedure Planning in Instructional Videos

The experiments show that the proposed latent space planning is able to learn plannable semantic representations without explicit supervision, which enables sequential reasoning on real-world videos and leads to stronger generalization compared to existing planning approaches and neural network policies.

References

SHOWING 1-10 OF 60 REFERENCES

The Curious Robot: Learning Visual Representations via Physical Interactions

This work builds one of the first systems on a Baxter platform that pushes, pokes, grasps and observes objects in a tabletop environment, with each datapoint providing supervision to a shared ConvNet architecture allowing us to learn visual representations.

Unsupervised Visual Representation Learning by Context Prediction

It is demonstrated that the feature representation learned using this within-image context indeed captures visual similarity across images and allows us to perform unsupervised visual discovery of objects like cats, people, and even birds from the Pascal VOC 2011 detection dataset.

Learning to See by Moving

It is found that using the same number of training images, features learnt using egomotion as supervision compare favourably to features learnt with class-label as supervision on the tasks of scene recognition, object recognition, visual odometry and keypoint matching.

Patch to the Future: Unsupervised Visual Prediction

This paper presents a conceptually simple but surprisingly powerful method for visual prediction which combines the effectiveness of mid-level visual elements with temporal modeling and shows that it is comparable to supervised methods for event prediction.

Visual Semantic Planning Using Deep Successor Representations

This work addresses the problem of visual semantic planning: the task of predicting a sequence of actions from visual observations that transform a dynamic environment from an initial state to a goal state, and develops a deep predictive model based on successor representations.

Anticipating Visual Representations from Unlabeled Video

This work presents a framework that capitalizes on temporal structure in unlabeled video to learn to anticipate human actions and objects and applies recognition algorithms on the authors' predicted representation to anticipate objects and actions.

Visual Forecasting by Imitating Dynamics in Natural Sequences

A general framework for visual forecasting, which directly imitates visual sequences without additional supervision by formulating visual forecasting as an inverse reinforcement learning (IRL) problem, and directly imitate the dynamics in natural sequences from their raw pixel values.

Learning to Poke by Poking: Experiential Learning of Intuitive Physics

A novel approach based on deep neural networks is proposed for modeling the dynamics of robot's interactions directly from images, by jointly estimating forward and inverse models of dynamics.

KrishnaCam: Using a longitudinal, single-person, egocentric dataset for scene understanding tasks

The ability to predict the near-future trajectory of the student in broad set of outdoor situations that includes following sidewalks, stopping to wait for a bus, taking a daily path to work, and the lack of movement while eating food is demonstrated.

Understanding egocentric activities

This work presents a method to analyze daily activities using video from an egocentric camera, and shows that joint modeling of activities, actions, and objects leads to superior performance in comparison to the case where they are considered independently.
...