Temporal Dynamic Graph LSTM for Action-Driven Video Object Detection

@article{Yuan2017TemporalDG,
  title={Temporal Dynamic Graph LSTM for Action-Driven Video Object Detection},
  author={Yuan Yuan and Xiaodan Liang and X. Wang and Dit-Yan Yeung and Abhinav Kumar Gupta},
  journal={2017 IEEE International Conference on Computer Vision (ICCV)},
  year={2017},
  pages={1819-1828}
}
In this paper, we investigate a weakly-supervised object detection framework. Most existing frameworks focus on using static images to learn object detectors. However, these detectors often fail to generalize to videos because of the existing domain shift. Therefore, we investigate learning these detectors directly from boring videos of daily activities. Instead of using bounding boxes, we explore the use of action descriptions as supervision since they are relatively easy to gather. A common… 
Neural Message Passing on Hybrid Spatio-Temporal Visual and Symbolic Graphs for Video Understanding
TLDR
The generality of the approach is demonstrated on a variety of tasks, such as temporal subactivity classification and object affordance classification on the CAD-120 dataset and multilabel temporal action localization on the large scale Charades dataset, where it outperform existing deep learning approaches, using only raw RGB frames.
Activity Driven Weakly Supervised Object Detection
TLDR
This work shows that the action depicted in the image/video can provide strong cues about the location of the associated object and learns a spatial prior for the object dependent on the action, and incorporates this prior to simultaneously train a joint object detection and action classification model.
Videos as Space-Time Region Graphs
TLDR
The proposed graph representation achieves state-of-the-art results on the Charades and Something-Something datasets and obtains a huge gain when the model is applied in complex environments.
Graph-Based Multi-Interaction Network for Video Question Answering
TLDR
A graph-based relation-aware neural network is proposed to explore a more fine-grained visual representation, which could explore the relationships and dependencies between objects spatially and temporally in videos.
Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions
TLDR
A contrastive weakly supervised training loss that aims to jointly associate spatiotemporal regions in a video with an action and object vocabulary and encourage temporal continuity of the visual appearance of moving objects as a form of self-supervision is introduced.
Action Graphs: Weakly-supervised Action Localization with Graph Convolution Networks
TLDR
This work presents a novel method that uses graph convolutions to explicitly model similarity between video moments and pushes the state of the art on THUMOS’14, ActivityNet 1.2, and Charades for weakly- supervised action localization.
Weakly Supervised Visual Semantic Parsing
TLDR
A generalized formulation of SGG is proposed, namely Visual Semantic Parsing, which disentangles entity and predicate recognition, and enables sub-quadratic performance, and the first graph-based weakly supervised learning framework, based on a novel graph alignment algorithm, which enables training without bounding box annotations is proposed.
ASCNet: Action Semantic Consistent Learning of Arbitrary Progress Levels for Early Action Prediction
TLDR
A novel Action Semantic Consistent learning network (ASCNet) under the teacher-student framework is proposed for early action prediction, which has achieved state-of-the-art performance on two benchmarks.
Spatial Parsing and Dynamic Temporal Pooling networks for Human-Object Interaction detection
TLDR
The Spatial Parsing and Dynamic Temporal Pooling (SPDTP) network is introduced, which takes the entire video as a spatio-temporal graph with human and object nodes as input and predicts the difference between interactive and non-interactive pairs through explicit spatial parsing.
Representation Learning on Visual-Symbolic Graphs for Video Understanding
TLDR
A graph neural network for refining the representations of actors, objects and their interactions on the resulting hybrid graph goes beyond current approaches that assume nodes and edges are of the same type, operate on graphs with fixed edge weights and do not use a symbolic graph.
...
...

References

SHOWING 1-10 OF 53 REFERENCES
Video Object Discovery and Co-Segmentation with Extremely Weak Supervision
TLDR
The proposed spatio-temporal energy minimization formulation for simultaneous video object discovery and co-segmentation across multiple videos containing irrelevant frames compares favorably with the state-of-the-art in all of these experiments.
Track and Transfer: Watching Videos to Simulate Strong Human Supervision for Weakly-Supervised Object Detection
TLDR
This framework transfers tracked object boxes from weakly-labeled videos to weak- labeled images to automatically generate pseudo ground-truth boxes, which replace manually annotated bounding boxes, and designs a hough transform algorithm to vote for the best box to serve as the pseudo GT for each image, and uses them to train an object detector.
Semantic Object Parsing with Graph LSTM
TLDR
The Graph Long Short-Term Memory network is proposed, which is the generalization of LSTM from sequential data or multi-dimensional data to general graph-structured data.
Video Summarization with Long Short-Term Memory
TLDR
Long Short-Term Memory (LSTM), a special type of recurrent neural networks are used to model the variable-range dependencies entailed in the task of video summarization to improve summarization by reducing the discrepancies in statistical properties across those datasets.
Learning object class detectors from weakly annotated video
TLDR
It is shown that training from a combination of weakly annotated videos and fully annotated still images using domain adaptation improves the performance of a detector trained from still images alone.
Unsupervised Object Discovery and Segmentation in Videos
TLDR
This work shows how to integrate motion information in parallel with appearance cues into a common conditional random field formulation to automatically discover object categories from videos to tremendously easing the task of finding recurring objects over an unsorted set of images.
You Only Look Once: Unified, Real-Time Object Detection
TLDR
Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background, and outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.
Weakly Supervised Object Localization with Multi-Fold Multiple Instance Learning
TLDR
This work follows a multiple-instance learning approach that iteratively trains the detector and infers the object locations in the positive training images and proposes a window refinement method, which improves the localization accuracy by incorporating an objectness prior.
Object Detection from Video Tubelets with Convolutional Neural Networks
TLDR
This work introduces a complete framework for the VID task based on still-image object detection and general object tracking, and proposes a temporal convolution network to incorporate temporal information to regularize the detection results and shows its effectiveness for the task.
Revealing Event Saliency in Unconstrained Video Collection
TLDR
This paper proposes an unsupervised event saliency revealing framework that first extracts features from multiple modalities to represent each shot in the given video collection, and systematically compares the method to a number of baseline methods on the TRECVID benchmarks.
...
...