Long-Term Feature Banks for Detailed Video Understanding

@article{Wu2019LongTermFB,
  title={Long-Term Feature Banks for Detailed Video Understanding},
  author={Chao-Yuan Wu and Christoph Feichtenhofer and Haoqi Fan and Kaiming He and Philipp Kr{\"a}henb{\"u}hl and Ross B. Girshick},
  journal={2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2019},
  pages={284-293}
}
To understand the world, we humans constantly need to relate the present to the past, and put events in context. [...] Key Result Our experiments demonstrate that augmenting 3D convolutional networks with a long-term feature bank yields state-of-the-art results on three challenging video datasets: AVA, EPIC-Kitchens, and Charades. Code is available online.Expand

Figures, Tables, and Topics from this paper

Towards Long-Form Video Understanding
TLDR
A framework for modeling long-form videos and evaluation protocols on large-scale datasets are introduced and it is shown that existing state-of-the-art short-term models are limited for long- form tasks. Expand
ViViT: A Video Vision Transformer
TLDR
This work presents pure-transformer based models for video classification, drawing upon the recent success of such models in image classification, and shows how to effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets. Expand
Long-term Behaviour Recognition in Videos with Actor-focused Region Attention
TLDR
The Multi-Regional fine-tuned 3D-CNN, topped with Actor Focus and Region Attention, largely improves the performance of baseline 3D architectures, achieving state-of-the-art results on Breakfast, a well known long-term activity recognition benchmark. Expand
Focused Attention for Action Recognition
TLDR
This work relies on network-specific saliency outputs to drive an attention model that provides tighter crops around relevant video regions and shows how this strategy improves performance for the action recognition task. Expand
VideoGraph: Recognizing Minutes-Long Human Activities in Videos
TLDR
The graph, its nodes and edges are learned entirely from video datasets, making VideoGraph applicable to problems without node-level annotation, and it is demonstrated that VideoGraph is able to learn the temporal structure of human activities in minutes-long videos. Expand
Action recognition on continuous video
TLDR
A novel framework, named as long-term video action recognition (LVAR) to perform generic action classification in the continuous video by introducing a partial recurrence connection to propagate information within every layer of a spatial-temporal network, such as the well-known C3D. Expand
A Comprehensive Study of Deep Video Action Recognition
TLDR
A comprehensive survey of over 200 existing papers on deep learning for video action recognition is provided, starting with early attempts at adapting deep learning, then to the two-stream networks, followed by the adoption of 3D convolutional kernels, and finally to the recent compute-efficient models. Expand
Listen to Look: Action Recognition by Previewing Audio
TLDR
A framework for efficient action recognition in untrimmed video that uses audio as a preview mechanism to eliminate both short-term and long-term visual redundancies is proposed, and an ImgAud2Vid framework is devised that hallucinates clip-level features by distilling from lighter modalities, reducingShort-term temporal redundancy for efficient video-level recognition. Expand
Gate-Shift Networks for Video Action Recognition
TLDR
An extensive evaluation of the proposed Gate-Shift Module is performed to study its effectiveness in video action recognition, achieving state-of-the-art results on Something Something-V1 and Diving48 datasets, and obtaining competitive results on EPIC-Kitchens with far less model complexity. Expand
PGT: A Progressive Method for Training Models on Long Videos
TLDR
This work proposes to treat videos as serial fragments satisfying Markov property, and train it as a whole by progressively propagating information through the temporal dimension in multiple steps, which is able to train long videos end-to-end with limited resources and ensures the effective transmission of information. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 65 REFERENCES
Long-Term Temporal Convolutions for Action Recognition
TLDR
It is demonstrated that LTC-CNN models with increased temporal extents improve the accuracy of action recognition and the impact of different low-level representations, such as raw values of video pixels and optical flow vector fields, and the importance of high-quality optical flow estimation for learning accurate action models. Expand
Learnable pooling with Context Gating for video classification
TLDR
This work explores combinations of learnable pooling techniques such as Soft Bag-of-words, Fisher Vectors, NetVLAD, GRU and LSTM to aggregate video features over time, and introduces a learnable non-linear network unit, named Context Gating, aiming at modeling in-terdependencies between features. Expand
Two-Stream Convolutional Networks for Action Recognition in Videos
TLDR
This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Expand
VideoLSTM convolves, attends and flows for action recognition
TLDR
This work presents a new architecture for end-to-end sequence learning of actions in video, called VideoLSTM, and introduces motion-based attention, which can be used for action localization by relying on just the action class label. Expand
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident.Expand
Compressed Video Action Recognition
TLDR
This work proposes to train a deep network directly on the compressed video, using H.264, HEVC, etc., which has a higher information density, and found the training to be easier. Expand
Beyond short snippets: Deep networks for video classification
TLDR
This work proposes and evaluates several deep neural network architectures to combine image information across a video over longer time periods than previously attempted, and proposes two methods capable of handling full length videos. Expand
Long-term recurrent convolutional networks for visual recognition and description
TLDR
A novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and shows such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized. Expand
Temporal Modeling Approaches for Large-scale Youtube-8M Video Understanding
TLDR
This paper describes the solution for the video recognition task of the Google Cloud and YouTube-8M Video Understanding Challenge that ranked the 3rd place and demonstrates that the proposed temporal modeling approaches can significantly improve existing temporal modeled approaches in the large-scale video recognition tasks. Expand
Large-Scale Video Classification with Convolutional Neural Networks
TLDR
This work studies multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggests a multiresolution, foveated architecture as a promising way of speeding up the training. Expand
...
1
2
3
4
5
...