Temporal Dynamic Graph LSTM for Action-Driven Video Object Detection
@article{Yuan2017TemporalDG, title={Temporal Dynamic Graph LSTM for Action-Driven Video Object Detection}, author={Yuan Yuan and Xiaodan Liang and X. Wang and Dit-Yan Yeung and Abhinav Kumar Gupta}, journal={2017 IEEE International Conference on Computer Vision (ICCV)}, year={2017}, pages={1819-1828} }
In this paper, we investigate a weakly-supervised object detection framework. Most existing frameworks focus on using static images to learn object detectors. However, these detectors often fail to generalize to videos because of the existing domain shift. Therefore, we investigate learning these detectors directly from boring videos of daily activities. Instead of using bounding boxes, we explore the use of action descriptions as supervision since they are relatively easy to gather. A common…
Figures and Tables from this paper
69 Citations
Neural Message Passing on Hybrid Spatio-Temporal Visual and Symbolic Graphs for Video Understanding
- Computer ScienceArXiv
- 2019
The generality of the approach is demonstrated on a variety of tasks, such as temporal subactivity classification and object affordance classification on the CAD-120 dataset and multilabel temporal action localization on the large scale Charades dataset, where it outperform existing deep learning approaches, using only raw RGB frames.
Activity Driven Weakly Supervised Object Detection
- Computer Science2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2019
This work shows that the action depicted in the image/video can provide strong cues about the location of the associated object and learns a spatial prior for the object dependent on the action, and incorporates this prior to simultaneously train a joint object detection and action classification model.
Videos as Space-Time Region Graphs
- Computer ScienceECCV
- 2018
The proposed graph representation achieves state-of-the-art results on the Charades and Something-Something datasets and obtains a huge gain when the model is applied in complex environments.
Graph-Based Multi-Interaction Network for Video Question Answering
- Computer ScienceIEEE Transactions on Image Processing
- 2021
A graph-based relation-aware neural network is proposed to explore a more fine-grained visual representation, which could explore the relationships and dependencies between objects spatially and temporally in videos.
Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions
- Computer Science2021 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2021
A contrastive weakly supervised training loss that aims to jointly associate spatiotemporal regions in a video with an action and object vocabulary and encourage temporal continuity of the visual appearance of moving objects as a form of self-supervision is introduced.
Action Graphs: Weakly-supervised Action Localization with Graph Convolution Networks
- Computer Science2020 IEEE Winter Conference on Applications of Computer Vision (WACV)
- 2020
This work presents a novel method that uses graph convolutions to explicitly model similarity between video moments and pushes the state of the art on THUMOS’14, ActivityNet 1.2, and Charades for weakly- supervised action localization.
Weakly Supervised Visual Semantic Parsing
- Computer Science2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2020
A generalized formulation of SGG is proposed, namely Visual Semantic Parsing, which disentangles entity and predicate recognition, and enables sub-quadratic performance, and the first graph-based weakly supervised learning framework, based on a novel graph alignment algorithm, which enables training without bounding box annotations is proposed.
ASCNet: Action Semantic Consistent Learning of Arbitrary Progress Levels for Early Action Prediction
- Computer Science
- 2022
A novel Action Semantic Consistent learning network (ASCNet) under the teacher-student framework is proposed for early action prediction, which has achieved state-of-the-art performance on two benchmarks.
Spatial Parsing and Dynamic Temporal Pooling networks for Human-Object Interaction detection
- Computer ScienceArXiv
- 2022
The Spatial Parsing and Dynamic Temporal Pooling (SPDTP) network is introduced, which takes the entire video as a spatio-temporal graph with human and object nodes as input and predicts the difference between interactive and non-interactive pairs through explicit spatial parsing.
Representation Learning on Visual-Symbolic Graphs for Video Understanding
- Computer ScienceECCV
- 2020
A graph neural network for refining the representations of actors, objects and their interactions on the resulting hybrid graph goes beyond current approaches that assume nodes and edges are of the same type, operate on graphs with fixed edge weights and do not use a symbolic graph.
References
SHOWING 1-10 OF 53 REFERENCES
Video Object Discovery and Co-Segmentation with Extremely Weak Supervision
- Computer ScienceIEEE Trans. Pattern Anal. Mach. Intell.
- 2017
The proposed spatio-temporal energy minimization formulation for simultaneous video object discovery and co-segmentation across multiple videos containing irrelevant frames compares favorably with the state-of-the-art in all of these experiments.
Track and Transfer: Watching Videos to Simulate Strong Human Supervision for Weakly-Supervised Object Detection
- Computer Science, Environmental Science2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016
This framework transfers tracked object boxes from weakly-labeled videos to weak- labeled images to automatically generate pseudo ground-truth boxes, which replace manually annotated bounding boxes, and designs a hough transform algorithm to vote for the best box to serve as the pseudo GT for each image, and uses them to train an object detector.
Semantic Object Parsing with Graph LSTM
- Computer ScienceECCV
- 2016
The Graph Long Short-Term Memory network is proposed, which is the generalization of LSTM from sequential data or multi-dimensional data to general graph-structured data.
Video Summarization with Long Short-Term Memory
- Computer ScienceECCV
- 2016
Long Short-Term Memory (LSTM), a special type of recurrent neural networks are used to model the variable-range dependencies entailed in the task of video summarization to improve summarization by reducing the discrepancies in statistical properties across those datasets.
Learning object class detectors from weakly annotated video
- Computer Science2012 IEEE Conference on Computer Vision and Pattern Recognition
- 2012
It is shown that training from a combination of weakly annotated videos and fully annotated still images using domain adaptation improves the performance of a detector trained from still images alone.
Unsupervised Object Discovery and Segmentation in Videos
- Computer ScienceBMVC
- 2013
This work shows how to integrate motion information in parallel with appearance cues into a common conditional random field formulation to automatically discover object categories from videos to tremendously easing the task of finding recurring objects over an unsorted set of images.
You Only Look Once: Unified, Real-Time Object Detection
- Computer Science2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016
Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background, and outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.
Weakly Supervised Object Localization with Multi-Fold Multiple Instance Learning
- Computer ScienceIEEE Transactions on Pattern Analysis and Machine Intelligence
- 2017
This work follows a multiple-instance learning approach that iteratively trains the detector and infers the object locations in the positive training images and proposes a window refinement method, which improves the localization accuracy by incorporating an objectness prior.
Object Detection from Video Tubelets with Convolutional Neural Networks
- Computer Science2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016
This work introduces a complete framework for the VID task based on still-image object detection and general object tracking, and proposes a temporal convolution network to incorporate temporal information to regularize the detection results and shows its effectiveness for the task.
Revealing Event Saliency in Unconstrained Video Collection
- Computer ScienceIEEE Transactions on Image Processing
- 2017
This paper proposes an unsupervised event saliency revealing framework that first extracts features from multiple modalities to represent each shot in the given video collection, and systematically compares the method to a number of baseline methods on the TRECVID benchmarks.