Video Action Transformer Network
@article{Girdhar2018VideoAT, title={Video Action Transformer Network}, author={Rohit Girdhar and Jo{\~a}o Carreira and Carl Doersch and Andrew Zisserman}, journal={2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2018}, pages={244-253} }
We introduce the Action Transformer model for recognizing and localizing human actions in video clips. [] Key Method We show that by using high-resolution, person-specific, class-agnostic queries, the model spontaneously learns to track individual people and to pick up on semantic context from the actions of others. Additionally its attention mechanism learns to emphasize hands and faces, which are often crucial to discriminate an action – all without explicit supervision other than boxes and class labels…
Figures and Tables from this paper
463 Citations
Knowledge Fusion Transformers for Video Action Recognition
- Computer ScienceArXiv
- 2020
A self-attention based feature enhancer to fuse action knowledge in 3D inception based spatio-temporal context of the video clip intended to be classified is presented.
Knowledge Fusion Transformers for Video Action Classification
- Computer Science
- 2020
A self-attention based feature enhancer to fuse action knowledge in 3D inception based spatiotemporal context of the video clip intended to be classified is presented.
ActionFormer: Localizing Moments of Actions with Transformers
- Computer ScienceECCV
- 2022
ActionFormer combines a multiscale feature representation with local self-attention, and uses a light-weighted decoder to classify every moment in time and estimate the corresponding action boundaries, and results in major improvements upon prior works.
An Efficient Human Instance-Guided Framework for Video Action Recognition
- Computer ScienceSensors
- 2021
A new human instance-level video action recognition framework is proposed, which represents the instance- level features using human boxes and keypoints, and action region features are used as the inputs of the temporal action head network, which makes this framework more discriminative.
Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed Videos
- Computer ScienceECCV
- 2020
This paper presents a spatio-temporal action recognition model that is trained with only video-level labels, which are significantly easier to annotate and reports the first weakly-supervised results on the AVA dataset and state-of-the-art results among weakly -supervised methods on UCF101-24.
TxVAD: Improved Video Action Detection by Transformers
- Computer ScienceACM Multimedia
- 2022
A conceptually simple paradigm for video action detection using Transformers, which effectively removes the need for specialized components and achieves superior performance, without using pre-trained person/object detectors, RPN, or memory bank.
Actor-Transformers for Group Activity Recognition
- Computer Science2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2020
This paper proposes an actor-transformer model able to learn and selectively extract information relevant for group activity recognition, and achieves state-of-the-art results on two publicly available benchmarks for Group activity recognition.
Learning Context for Weakly-Supervised Action Detection Using Graph Convolutional Networks Name:
- Computer Science
- 2020
An architecture based on self-attention and Graph Convolutional Networks is introduced in order to model contextual cues, such as human-human and human-object interactions, so as to improve the classification of human actions in video for the task of action detection.
Reformulating Zero-shot Action Recognition for Multi-label Actions (Supplementary Material)
- Computer Science
- 2022
Since the AVA dataset consists of multiple actors within one video and ZSAR focuses only on the classification task, we extract clips centered on the ground-truth bounding boxes for each actor in the…
Augmented Transformer with Adaptive Graph for Temporal Action Proposal Generation
- Computer ScienceHCMA@MM
- 2022
This work enhances the vanilla transformer by equipping a snippet actionness loss and a front block, dubbed augmented transformer, and it improves the abilities of capturing long-range dependencies and learning robust feature for noisy action instances, and outperforms state-of-the-art TAPG methods.
References
SHOWING 1-10 OF 54 REFERENCES
Action Recognition using Visual Attention
- Computer ScienceNIPS 2015
- 2015
A soft attention based model using multi-layered Recurrent Neural Networks with Long Short-Term Memory units which are deep both spatially and temporally for action recognition in videos.
Attentional Pooling for Action Recognition
- Computer ScienceNIPS
- 2017
This work introduces a simple yet surprisingly powerful model to incorporate attention in action recognition and human object interaction tasks, and introduces a novel derivation of bottom-up and top-down attention as low-rank approximations of bilinear pooling methods (typically used for fine-grained classification).
ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification
- Computer Science2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017
A new video representation for action classification that aggregates local convolutional features across the entire spatio-temporal extent of the video and outperforms other baselines with comparable base architectures on HMDB51, UCF101, and Charades video classification benchmarks.
Human Action Recognition: Pose-Based Attention Draws Focus to Hands
- Computer Science2017 IEEE International Conference on Computer Vision Workshops (ICCVW)
- 2017
An extensive ablation study is performed to show the strengths of this approach and the conditioning aspect of the attention mechanism and to evaluate the method on the largest currently available human action recognition dataset, NTU-RGB+D, and report state-of-the-art results.
Asynchronous Temporal Fields for Action Recognition
- Computer Science2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017
This work proposes a fully-connected temporal CRF model for reasoning over various aspects of activities that includes objects, actions, and intentions, where the potentials are predicted by a deep network.
VideoCapsuleNet: A Simplified Network for Action Detection
- Computer ScienceNeurIPS
- 2018
A 3D capsule network for videos, called VideoCapsuleNet: a unified network for action detection which can jointly perform pixel-wise action segmentation along with action classification, and introduces capsule-pooling in the convolutional capsule layer to address this issue which makes the voting algorithm tractable.
Human Activity Recognition with Pose-driven Attention to RGB
- Computer ScienceBMVC
- 2018
It is of high interest to shift the attention to different hands at different time steps depending on the activity itself, and state-of-the-art results are achieved on the largest dataset for human activity recognition, namely NTU-RGB+D.
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
- Computer Science2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017
I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.2% on HMDB-51 and 97.9% on UCF-101 after pre-training on Kinetics, and a new Two-Stream Inflated 3D Conv net that is based on 2D ConvNet inflation is introduced.
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
- Computer ScienceArXiv
- 2012
This work introduces UCF101 which is currently the largest dataset of human actions and provides baseline action recognition results on this new dataset using standard bag of words approach with overall performance of 44.5%.
The Kinetics Human Action Video Dataset
- Computer ScienceArXiv
- 2017
The dataset is described, the statistics are described, how it was collected, and some baseline performance figures for neural network architectures trained and tested for human action classification on this dataset are given.