Integrating Human Gaze into Attention for Egocentric Activity Recognition

  title={Integrating Human Gaze into Attention for Egocentric Activity Recognition},
  author={Kyle Min and Jason J. Corso},
  journal={2021 IEEE Winter Conference on Applications of Computer Vision (WACV)},
  • Kyle Min, Jason J. Corso
  • Published 8 November 2020
  • Computer Science
  • 2021 IEEE Winter Conference on Applications of Computer Vision (WACV)
It is well known that human gaze carries significant information about visual attention. However, there are three main difficulties in incorporating the gaze data in an attention mechanism of deep neural networks: (i) the gaze fixation points are likely to have measurement errors due to blinking and rapid eye movements; (ii) it is unclear when and how much the gaze data is correlated with visual attention; and (iii) gaze data is not always available in many real-world situations. In this work… Expand

Figures and Tables from this paper

With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition
A transformer-based multimodal model is proposed that ingests video and audio as input modalities, with an explicit language model providing action sequence context to enhance the predictions. Expand
Accessing Passersby Proxemic Signals through a Head-Worn Camera: Opportunities and Limitations for the Blind
Analysis of data collected in a study with blind and sighted participants provides insights into dyadic behaviors for assistive pedestrian detection and lead to implications for the design of future head-worn cameras and interactions. Expand
Leveraging Human Selective Attention for Medical Image Analysis with Limited Training Data
  • Yifei Huang, Xiaoxiao Li, +6 authors Yoichi Sato
  • Computer Science, Engineering
  • 2021
Human gaze is a cost-efficient physiological data that reveals human underlying attentional patterns. The selective attention mechanism helps the cognition system focus on task-relevant visual cluesExpand
Stacked Temporal Attention: Improving First-person Action Recognition by Emphasizing Discriminative Clips
  • Lijin Yang, Yifei Huang, Yusuke Sugano, Yoichi Sato
  • Computer Science
  • 2021
First-person action recognition is a challenging task in video understanding. Because of strong ego-motion and a limited field of view, many backgrounds or noisy frames in a first-person video canExpand


Learning to Recognize Daily Actions Using Gaze
An inference method is presented that can predict the best sequence of gaze locations and the associated action label from an input sequence of images and demonstrates improvements in action recognition rates and gaze prediction accuracy relative to state-of-the-art methods. Expand
Mutual Context Network for Jointly Estimating Egocentric Gaze and Action
A novel mutual context network (MCN) is proposed that jointly learns action-dependent gaze prediction and gaze-guided action recognition in an end-to-end manner and achieves state-of-the-art performance of both gaze predictions and action recognition. Expand
Learning Spatiotemporal Attention for Egocentric Action Recognition
The experimental results demonstrate that the proposed spatiotemporal attention module is able to outperform the state-of-the-art methods by a large margin on the standard EGTEA Gaze+ dataset and produce attention maps that are consistent with human gaze. Expand
In the Eye of Beholder: Joint Learning of Gaze and Actions in First Person Video
A novel deep model is proposed for joint gaze estimation and action recognition in First Person Vision that describes the participant’s gaze as a probabilistic variable and models its distribution using stochastic units in a deep network to generate an attention map. Expand
Attention is All We Need: Nailing Down Object-centric Attention for Egocentric Activity Recognition
An end-to-end trainable deep neural network model for egocentric activity recognition is proposed that surpasses by up to +6% points recognition accuracy the currently best performing method that leverages hand segmentation and object location strong supervision for training. Expand
Multitask Learning to Improve Egocentric Action Recognition
This work considers learning the verbs and nouns from which action labels consist of and predict coordinates that capture the hand locations and the gaze-based visual saliency for all the frames of the input video segments to tackle action recognition in egocentric videos. Expand
Egocentric Activity Prediction via Event Modulated Attention
This work explicitly addresses issues in state-of-the-art egocentric activity understanding techniques by proposing an asynchronous gaze- event driven attentive activity prediction network, built on a gaze-event extraction module inspired by the fact that gaze moving in/out of a certain object most probably indicates the occurrence/ending of acertain activity. Expand
Gaze cueing of attention: visual attention, social cognition, and individual differences.
This review aims to provide a comprehensive overview of past and current research into the perception of gaze behavior and its effect on the observer, including gaze-cueing paradigm that has been used to investigate the mechanisms of joint attention. Expand
Delving into egocentric actions
A novel set of egocentric features are presented and shown how they can be combined with motion and object features and a significant performance boost over all previous state-of-the-art methods is uncovered. Expand
Going Deeper into First-Person Activity Recognition
By learning to recognize objects, actions and activities jointly, the performance of individual recognition tasks also increase by 30% (actions) and 14% ( objects) and the results of extensive ablative analysis are included to highlight the importance of network design decisions. Expand