• Corpus ID: 208267647

Forecasting Human Object Interaction: Joint Prediction of Motor Attention and Egocentric Activity

@article{Liu2019ForecastingHO,
  title={Forecasting Human Object Interaction: Joint Prediction of Motor Attention and Egocentric Activity},
  author={Miao Liu and Siyu Tang and Yin Li and James M. Rehg},
  journal={ArXiv},
  year={2019},
  volume={abs/1911.10967}
}
We address the challenging task of anticipating human-object interaction in first person videos. Most existing methods ignore how the camera wearer interacts with the objects, or simply consider body motion as a separate modality. In contrast, we observe that the international hand movement reveals critical information about the future activity. Motivated by this, we adopt intentional hand movement as a future representation and propose a novel deep network that jointly models and predicts the… 

Figures and Tables from this paper

Forecasting Action through Contact Representations from First Person Video

Human visual understanding of action is reliant on anticipation of contact as is demonstrated by pioneering work in cognitive science. Taking inspiration from this, we introduce representations and

Predicting the Future from First Person (Egocentric) Vision: A Survey

In the Eye of the Beholder: Gaze and Actions in First Person Video

A novel deep model is proposed for joint gaze estimation and action recognition in FPV that can be applied to larger scale FPV dataset---EPIC-Kitchens even without using gaze, offering new state-of-the-art results on FPV action recognition.

Untrimmed Action Anticipation

It is argued that, despite the recent advances in the field, trimmed action anticipation has a limited applicability in realworld scenarios where it is important to deal with “untrimmed” video inputs and it cannot be assumed that the exact moment in which the action will begin is known at test time, and proposes an untrimmed action anticipation task.

Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction

This work proposes TransFusion, a multimodal transformer-based architecture that effectively makes use of the representational power of language by summarizing past actions concisely, and leverages pre-trained image captioning models and summarizes the caption.

Interactive Prototype Learning for Egocentric Action Recognition

An end-to-end Interactive Prototype Learning (IPL) framework to learn better active object representations by leveraging the motion cues from the actor is proposed and a set of verb prototypes to disentangle active object features from distracting object features are introduced.

Learning State-Aware Visual Representations from Audible Interactions

A novel self-supervised objective that learns from audible state changes caused by interactions is proposed, and improvements on several downstream tasks, including action recognition, long-term action anticipation, and object state change classification are shown.

Egocentric Object Manipulation Graphs

We introduce Egocentric Object Manipulation Graphs (Ego-OMG) - a novel representation for activity modeling and anticipation of near future actions integrating three components: 1) semantic temporal

Review of Video Predictive Understanding: Early Action Recognition and Future Action Prediction

The major sub-areas of the broad area of video predictive understanding are introduced, which recently have received intensive attention and proven to have practical value and a thorough review of various early action recognition and future action prediction algorithms are provided with suitably organized divisions.

Ego-Exo: Transferring Visual Representations from Third-person to First-person Videos

The "Ego-Exo" framework can be seamlessly integrated into standard video models; it outperforms all baselines when fine-tuned for egocentric activity recognition, achieving state-of-the-art results on Charades-Ego and EPIC-Kitchens-100.

References

SHOWING 1-10 OF 71 REFERENCES

Next-active-object prediction from egocentric videos

Predicting Gaze in Egocentric Video by Learning Task-dependent Attention Transition

A hybrid model based on deep neural networks which integrates task-dependent attention transition with bottom-up saliency prediction is proposed which significantly outperforms state-of-the-art gaze prediction methods and is able to learn meaningful transition of human attention.

What Would You Expect? Anticipating Egocentric Actions With Rolling-Unrolling LSTMs and Modality Attention

This work tackles the problem proposing an architecture able to anticipate actions at multiple temporal scales using two LSTMs to summarize the past and formulate predictions about the future using a novel Modality ATTention mechanism which learns to weigh modalities in an adaptive fashion.

In the Eye of the Beholder: Gaze and Actions in First Person Video

A novel deep model is proposed for joint gaze estimation and action recognition in FPV that can be applied to larger scale FPV dataset---EPIC-Kitchens even without using gaze, offering new state-of-the-art results on FPV action recognition.

Delving into egocentric actions

A novel set of egocentric features are presented and shown how they can be combined with motion and object features and a significant performance boost over all previous state-of-the-art methods is uncovered.

In the Eye of Beholder: Joint Learning of Gaze and Actions in First Person Video

A novel deep model is proposed for joint gaze estimation and action recognition in First Person Vision that describes the participant’s gaze as a probabilistic variable and models its distribution using stochastic units in a deep network to generate an attention map.

Egocentric Activity Prediction via Event Modulated Attention

This work explicitly addresses issues in state-of-the-art egocentric activity understanding techniques by proposing an asynchronous gaze- event driven attentive activity prediction network, built on a gaze-event extraction module inspired by the fact that gaze moving in/out of a certain object most probably indicates the occurrence/ending of acertain activity.

Cascaded Interactional Targeting Network for Egocentric Video Analysis

A novel EM-like learning framework is proposed to train the pixel-level deep convolutional neural network (DCNN) by seamlessly integrating weakly supervised data with a small set of strongly supervised data to achieve state-of-the-art hand segmentation performance.

Going Deeper into First-Person Activity Recognition

By learning to recognize objects, actions and activities jointly, the performance of individual recognition tasks also increase by 30% (actions) and 14% ( objects) and the results of extensive ablative analysis are included to highlight the importance of network design decisions.

Understanding egocentric activities

This work presents a method to analyze daily activities using video from an egocentric camera, and shows that joint modeling of activities, actions, and objects leads to superior performance in comparison to the case where they are considered independently.
...