Video Action Transformer Network

  title={Video Action Transformer Network},
  author={Rohit Girdhar and Jo{\~a}o Carreira and Carl Doersch and Andrew Zisserman},
  journal={2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
We introduce the Action Transformer model for recognizing and localizing human actions in video clips. [] Key Method We show that by using high-resolution, person-specific, class-agnostic queries, the model spontaneously learns to track individual people and to pick up on semantic context from the actions of others. Additionally its attention mechanism learns to emphasize hands and faces, which are often crucial to discriminate an action – all without explicit supervision other than boxes and class labels…

Figures and Tables from this paper

Knowledge Fusion Transformers for Video Action Recognition

A self-attention based feature enhancer to fuse action knowledge in 3D inception based spatio-temporal context of the video clip intended to be classified is presented.

Knowledge Fusion Transformers for Video Action Classification

A self-attention based feature enhancer to fuse action knowledge in 3D inception based spatiotemporal context of the video clip intended to be classified is presented.

ActionFormer: Localizing Moments of Actions with Transformers

ActionFormer combines a multiscale feature representation with local self-attention, and uses a light-weighted decoder to classify every moment in time and estimate the corresponding action boundaries, and results in major improvements upon prior works.

An Efficient Human Instance-Guided Framework for Video Action Recognition

A new human instance-level video action recognition framework is proposed, which represents the instance- level features using human boxes and keypoints, and action region features are used as the inputs of the temporal action head network, which makes this framework more discriminative.

Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed Videos

This paper presents a spatio-temporal action recognition model that is trained with only video-level labels, which are significantly easier to annotate and reports the first weakly-supervised results on the AVA dataset and state-of-the-art results among weakly -supervised methods on UCF101-24.

TxVAD: Improved Video Action Detection by Transformers

A conceptually simple paradigm for video action detection using Transformers, which effectively removes the need for specialized components and achieves superior performance, without using pre-trained person/object detectors, RPN, or memory bank.

Actor-Transformers for Group Activity Recognition

This paper proposes an actor-transformer model able to learn and selectively extract information relevant for group activity recognition, and achieves state-of-the-art results on two publicly available benchmarks for Group activity recognition.

Learning Context for Weakly-Supervised Action Detection Using Graph Convolutional Networks Name:

An architecture based on self-attention and Graph Convolutional Networks is introduced in order to model contextual cues, such as human-human and human-object interactions, so as to improve the classification of human actions in video for the task of action detection.

Reformulating Zero-shot Action Recognition for Multi-label Actions (Supplementary Material)

Since the AVA dataset consists of multiple actors within one video and ZSAR focuses only on the classification task, we extract clips centered on the ground-truth bounding boxes for each actor in the

Augmented Transformer with Adaptive Graph for Temporal Action Proposal Generation

This work enhances the vanilla transformer by equipping a snippet actionness loss and a front block, dubbed augmented transformer, and it improves the abilities of capturing long-range dependencies and learning robust feature for noisy action instances, and outperforms state-of-the-art TAPG methods.



Action Recognition using Visual Attention

A soft attention based model using multi-layered Recurrent Neural Networks with Long Short-Term Memory units which are deep both spatially and temporally for action recognition in videos.

Attentional Pooling for Action Recognition

This work introduces a simple yet surprisingly powerful model to incorporate attention in action recognition and human object interaction tasks, and introduces a novel derivation of bottom-up and top-down attention as low-rank approximations of bilinear pooling methods (typically used for fine-grained classification).

ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification

A new video representation for action classification that aggregates local convolutional features across the entire spatio-temporal extent of the video and outperforms other baselines with comparable base architectures on HMDB51, UCF101, and Charades video classification benchmarks.

Human Action Recognition: Pose-Based Attention Draws Focus to Hands

An extensive ablation study is performed to show the strengths of this approach and the conditioning aspect of the attention mechanism and to evaluate the method on the largest currently available human action recognition dataset, NTU-RGB+D, and report state-of-the-art results.

Asynchronous Temporal Fields for Action Recognition

This work proposes a fully-connected temporal CRF model for reasoning over various aspects of activities that includes objects, actions, and intentions, where the potentials are predicted by a deep network.

VideoCapsuleNet: A Simplified Network for Action Detection

A 3D capsule network for videos, called VideoCapsuleNet: a unified network for action detection which can jointly perform pixel-wise action segmentation along with action classification, and introduces capsule-pooling in the convolutional capsule layer to address this issue which makes the voting algorithm tractable.

Human Activity Recognition with Pose-driven Attention to RGB

It is of high interest to shift the attention to different hands at different time steps depending on the activity itself, and state-of-the-art results are achieved on the largest dataset for human activity recognition, namely NTU-RGB+D.

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.2% on HMDB-51 and 97.9% on UCF-101 after pre-training on Kinetics, and a new Two-Stream Inflated 3D Conv net that is based on 2D ConvNet inflation is introduced.

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

This work introduces UCF101 which is currently the largest dataset of human actions and provides baseline action recognition results on this new dataset using standard bag of words approach with overall performance of 44.5%.

The Kinetics Human Action Video Dataset

The dataset is described, the statistics are described, how it was collected, and some baseline performance figures for neural network architectures trained and tested for human action classification on this dataset are given.