• Corpus ID: 235358207

Transformed ROIs for Capturing Visual Transformations in Videos

  title={Transformed ROIs for Capturing Visual Transformations in Videos},
  author={Abhinav Rai and Fadime Sener and Angela Yao},
Modeling the visual changes that an action brings to a scene is critical for video understanding. Currently, CNNs process one local neighbourhood at a time, so contextual relationships over longer ranges, while still learnable, are indirect. We present TROI, a plug-and-play module for CNNs to reason between mid-level feature representations that are otherwise separated in space and time. The module relates localized visual entities such as hands and interacting objects and transforms their… 

Figures and Tables from this paper


A Closer Look at Spatiotemporal Convolutions for Action Recognition
A new spatiotemporal convolutional block "R(2+1)D" is designed which produces CNNs that achieve results comparable or superior to the state-of-the-art on Sports-1M, Kinetics, UCF101, and HMDB51.
Videos as Space-Time Region Graphs
The proposed graph representation achieves state-of-the-art results on the Charades and Something-Something datasets and obtains a huge gain when the model is applied in complex environments.
Object Level Visual Reasoning in Videos
A model capable of learning to reason about semantically meaningful spatio-temporal interactions in videos is proposed that allows the model to learn detailed spatial interactions that exist at a semantic, object-interaction relevant level.
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident.
A2-Nets: Double Attention Networks
This work proposes the "double attention block", a novel component that aggregates and propagates informative global features from the entire spatio-temporal space of input images/videos, enabling subsequent convolution layers to access featuresFrom the entire space efficiently.
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification
It is shown that it is possible to replace many of the 3D convolutions by low-cost 2D convolution, suggesting that temporal representation learning on high-level “semantic” features is more useful.
LatentGNN: Learning Efficient Non-local Relations for Visual Recognition
This work proposes an efficient and yet flexible non-local relation representation based on a novel class of graph neural networks that allows for a low-rank representation for the graph affinity matrix and to achieve a linear complexity in computation.
The “Something Something” Video Database for Learning and Evaluating Visual Common Sense
This work describes the ongoing collection of the “something-something” database of video prediction tasks whose solutions require a common sense understanding of the depicted situation, and describes the challenges in crowd-sourcing this data at scale.
RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition
This work introduces RubiksNet, a new efficient architecture for video action recognition which is based on a proposed learnable 3D spatiotemporal shift operation instead of a channel-wise shift-based primitive, and analyzes the suitability of the new primitive and explores several novel variations of the approach to enable stronger representational flexibility while maintaining an efficient design.
More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation
An lightweight and memory-friendly architecture for action recognition that performs on par with or better than current architectures by using only a fraction of resources, and a temporal aggregation module is proposed to model temporal dependencies in a video at very small additional computational costs.