Corpus ID: 152283010

VideoGraph: Recognizing Minutes-Long Human Activities in Videos

@article{Hussein2019VideoGraphRM,
  title={VideoGraph: Recognizing Minutes-Long Human Activities in Videos},
  author={Noureldien Hussein and Efstratios Gavves and Arnold W. M. Smeulders},
  journal={ArXiv},
  year={2019},
  volume={abs/1905.05143}
}
Many human activities take minutes to unfold. [...] Key Method VideoGraph learns a graph-based representation for human activities. The graph, its nodes and edges are learned entirely from video datasets, making VideoGraph applicable to problems without node-level annotation. The result is improvements over related works on benchmarks: Epic-Kitchen and Breakfast. Besides, we demonstrate that VideoGraph is able to learn the temporal structure of human activities in minutes-long videos.Expand
Long-term Behaviour Recognition in Videos with Actor-focused Region Attention
TLDR
The Multi-Regional fine-tuned 3D-CNN, topped with Actor Focus and Region Attention, largely improves the performance of baseline 3D architectures, achieving state-of-the-art results on Breakfast, a well known long-term activity recognition benchmark. Expand
Temporal Relational Modeling with Self-Supervision for Action Segmentation
TLDR
This paper introduces an effective GCN module, Dilated Temporal Graph Reasoning Module (DTGRM), designed to model temporal relations and dependencies between video frames at various time spans and outperforms state-of-the-art action segmentation models on three challenging datasets. Expand
Activity Graph Transformer for Temporal Action Localization
We introduce Activity Graph Transformer, an end-to-end learnable model for temporal action localization, that receives a video as input and directly predicts a set of action instances that appear inExpand
Temporal localization of actions in untrimmed videos
Action recognition is the process of identifying actions performed by one or more actor/s in a given context based on some observations. Actions come in all shapes and sizes, be it a simple actionExpand
No frame left behind: Full Video Action Recognition
TLDR
This work proposes full video action recognition and considers all video frames, and relies on temporally localized clustering in combination with fast Hamming distances in feature space to make this computational tractable. Expand
PGT: A Progressive Method for Training Models on Long Videos
TLDR
This work proposes to treat videos as serial fragments satisfying Markov property, and train it as a whole by progressively propagating information through the temporal dimension in multiple steps, which is able to train long videos end-to-end with limited resources and ensures the effective transmission of information. Expand
Coarse Temporal Attention Network (CTA-Net) for Driver’s Activity Recognition
TLDR
This work proposes a novel framework by exploiting the spatiotemporal attention to model the subtle changes in Driver’s activities, which significantly outperforms the state-of-the-art by a considerable margin with only RGB video as input. Expand
TimeGate: Conditional Gating of Segments in Long-range Activities
TLDR
TimeGate reduces the computation of existing CNNs on three benchmarks for long-range activities: Charades, Breakfast and MultiThumos and reduces the computations of I3D by 50% while maintaining the classification accuracy. Expand
RhyRNN: Rhythmic RNN for Recognizing Events in Long and Complex Videos
TLDR
This work proposes Rhythmic RNN (RhyRNN) which is capable of handling long video sequences as well as capturing rhythms at different scales, and proposes two novel modules: diversitydriven pooling (DivPool) and bilinear reweighting (BR), which consistently and hierarchically abstract higher-level information. Expand
NAS-TC: Neural Architecture Search on Temporal Convolutions for Complex Action Recognition
TLDR
This work proposes a new processing framework called Neural Architecture SearchTemporal Convolutional (NAS-TC), divided into two phases, which will have more reasonable parameter assignments and can handle minute-level videos. Expand
...
1
2
3
...

References

SHOWING 1-10 OF 63 REFERENCES
VideoLSTM convolves, attends and flows for action recognition
TLDR
This work presents a new architecture for end-to-end sequence learning of actions in video, called VideoLSTM, and introduces motion-based attention, which can be used for action localization by relying on just the action class label. Expand
Videos as Space-Time Region Graphs
TLDR
The proposed graph representation achieves state-of-the-art results on the Charades and Something-Something datasets and obtains a huge gain when the model is applied in complex environments. Expand
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident.Expand
Video Action Transformer Network
TLDR
The Action Transformer model for recognizing and localizing human actions in video clips is introduced and it is shown that by using high-resolution, person-specific, class-agnostic queries, the model spontaneously learns to track individual people and to pick up on semantic context from the actions of others. Expand
ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification
TLDR
A new video representation for action classification that aggregates local convolutional features across the entire spatio-temporal extent of the video and outperforms other baselines with comparable base architectures on HMDB51, UCF101, and Charades video classification benchmarks. Expand
Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos
TLDR
A novel variant of long short-term memory deep networks is defined for modeling these temporal relations via multiple input and output connections and it is shown that this model improves action labeling accuracy and further enables deeper understanding tasks ranging from structured retrieval to action prediction. Expand
Long-Term Feature Banks for Detailed Video Understanding
TLDR
This paper proposes a long-term feature bank—supportive information extracted over the entire span of a video—to augment state-of-the-art video models that otherwise would only view short clips of 2-5 seconds. Expand
AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions
  • C. Gu, Chen Sun, +8 authors J. Malik
  • Computer Science
  • 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
  • 2018
TLDR
The AVA dataset densely annotates 80 atomic visual actions in 437 15-minute video clips, where actions are localized in space and time, resulting in 1.59M action labels with multiple labels per person occurring frequently. Expand
Moments in Time Dataset: One Million Videos for Event Understanding
TLDR
The Moments in Time dataset, a large-scale human-annotated collection of one million short videos corresponding to dynamic events unfolding within three seconds, can serve as a new challenge to develop models that scale to the level of complexity and abstract reasoning that a human processes on a daily basis. Expand
The Kinetics Human Action Video Dataset
TLDR
The dataset is described, the statistics are described, how it was collected, and some baseline performance figures for neural network architectures trained and tested for human action classification on this dataset are given. Expand
...
1
2
3
4
5
...