Detecting Events and Key Actors in Multi-person Videos

  title={Detecting Events and Key Actors in Multi-person Videos},
  author={Vignesh Ramanathan and Jonathan Huang and Sami Abu-El-Haija and Alexander N. Gorban and Kevin P. Murphy and Li Fei-Fei},
  journal={2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
Multi-person event recognition is a challenging task, often with many people active in the scene but only a small subset contributing to an actual event. In this paper, we propose a model which learns to detect events in such videos while automatically "attending" to the people responsible for the event. Our model does not use explicit annotations regarding who or where those people are during training and testing. In particular, we track people in videos and use a recurrent neural network (RNN… 

Figures and Tables from this paper

Interaction Classification with Key Actor Detection in Multi-Person Sports Videos

A model to tackle the problem of interaction recognition from multi-person videos using a Recurrent Neural Network equipped with a time-varying attention mechanism and includes a qualitative analysis of the attention mechanism by visualizing the attention weights.

Global Motion Pattern Based Event Recognition in Multi-person Videos

Experimental analysis demonstrates that the proposed GMP based event recognition algorithm is capable to take benefit from spatial and temporal characteristics of GMP for effective event recognition.

Discovering Multi-Label Actor-Action Association in a Weakly Supervised Setting

This work proposes a baseline based on multi-instance and multi-label learning and proposes a novel approach that uses sets of actions as representation instead of modeling individual action classes.

Unsupervised Temporal Feature Aggregation for Event Detection in Unstructured Sports Videos

This paper identifies and solves two major problems: unsupervised identification of players in an unstructured setting and generalization of the trained models to pose variations due to arbitrary shooting angles and proposes a temporal feature aggregation algorithm using person re-identification features to obtain high player retrieval precision.

stagNet: An Attentive Semantic RNN for Group Activity and Individual Action Recognition

A novel attentive semantic recurrent neural network (RNN), namely, stagNet, is presented for understanding group activities and individual actions in videos, by combining the spatio-temporal attention mechanism and semantic graph modeling.

A Survey on Video Action Recognition in Sports: Datasets, Methods and Applications

—To understand human behaviors, action recognition based on videos is a common approach. Compared with image-based action recognition, videos provide much more information. Reducing the ambiguity of

Spatiotemporal Multi-Task Network for Human Activity Understanding

A spatiotemporal, multi-task, 3D deep convolutional neural network to detect actions in untrimmed videos and introduces a novel video representation, interlaced images, as an additional network input stream to better utilize the rich motion information in videos.

Soccer Video Event Detection Based on Deep Learning

A model that is able to detect events in long soccer games with a single pass through the video is proposed, and combined with replay detection, story clips are generated, which contain more complete temporal context, meeting audiences’ needs.

stagNet: An Attentive Semantic RNN for Group Activity Recognition

A novel attentive semantic recurrent neural network (RNN) for understanding group activities in videos, dubbed as stagNet, is proposed, based on the spatio-temporal attention and semantic graph, and adopted to attend to key persons/frames for improved performance.

Group event recognition in ice hockey

A possible solution for event detection in a more general setting is provided and two models are proposed that combine features of all players in a scene through an attention mechanism and produce promising results.



A large-scale benchmark dataset for event recognition in surveillance video

We introduce a new large-scale video dataset designed to assess the performance of diverse visual event recognition algorithms with a focus on continuous visual event recognition (CVER) in outdoor

Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities

  • M. RyooJ. Aggarwal
  • Computer Science
    2009 IEEE 12th International Conference on Computer Vision
  • 2009
A novel matching, spatio-temporal relationship match, which is designed to measure structural similarity between sets of features extracted from two videos, thereby enabling detection and localization of complex non-periodic activities.

Action and Event Recognition with Fisher Vectors on a Compact Feature Set

This work finds that for basic action recognition and localization MBH features alone are enough for state-of-the-art performance, and for complex events it is found that SIFT and MFCC features provide complementary cues.

Learning realistic human actions from movies

A new method for video classification that builds upon and extends several recent ideas including local space-time features,space-time pyramids and multi-channel non-linear SVMs is presented and shown to improve state-of-the-art results on the standard KTH action dataset.

Explicit Modeling of Human-Object Interactions in Realistic Videos

This work introduces an approach for learning human actions as interactions between persons and objects in realistic videos that explicitly localize in space and track over time both the object and the person, and represents an action as the trajectory of the object w.r.t. the human.

C3D: Generic Features for Video Analysis

Convolution 3D feature is proposed, a generic spatio-temporal feature obtained by training a deep 3-dimensional convolutional network on a large annotated video dataset comprising objects, scenes, actions, and other frequently occurring concepts that encapsulate appearance and motion cues and perform well on different video classification tasks.

Video Action Detection with Relational Dynamic-Poselets

A relational model for action detection, which first decomposes human action into temporal “key poses” and then further into spatial “action parts”, which not only allows to localize the action in a video stream, but also enables a detailed pose estimation of an actor.

Evaluation of Local Spatio-temporal Features for Action Recognition

It is demonstrated that regular sampling of space-time features consistently outperforms all testedspace-time interest point detectors for human actions in realistic settings and is a consistent ranking for the majority of methods over different datasets.

Discovering discriminative action parts from mid-level video representations

This work describes a mid-level approach for action recognition that forms clusters of trajectories that serve as candidates for the parts of an action and illustrates its potential to support a fine-grained analysis that not only gives a label to a video, but also identifies and localizes its constituent parts.

Trajectory-Based Modeling of Human Actions with Motion Reference Points

This paper proposes a simple representation specifically aimed at the modeling of human action recognition in videos that operates on top of visual codewords derived from local patch trajectories, and therefore does not require accurate foreground-background separation, which is typically a necessary step to model object relationships.