Flow-Guided Feature Aggregation for Video Object Detection

@article{Zhu2017FlowGuidedFA,
  title={Flow-Guided Feature Aggregation for Video Object Detection},
  author={Xizhou Zhu and Yujie Wang and Jifeng Dai and Lu Yuan and Yichen Wei},
  journal={2017 IEEE International Conference on Computer Vision (ICCV)},
  year={2017},
  pages={408-417}
}
Extending state-of-the-art object detectors from image to video is challenging. The accuracy of detection suffers from degenerated object appearances in videos, e.g., motion blur, video defocus, rare poses, etc. Existing work attempts to exploit temporal information on box level, but such methods are not trained end-to-end. We present flow-guided feature aggregation, an accurate and end-to-end learning framework for video object detection. It leverages temporal coherence on feature level… 

Figures and Tables from this paper

Learning Where to Focus for Efficient Video Object Detection
TLDR
A novel module called Learnable Spatio-Temporal Sampling (LSTS) has been proposed to learn semantic-level correspondences among adjacent frame features accurately and achieves state-of-the-art performance on the ImageNet VID dataset with less computational complexity and real-time speed.
Temporal Meta-Adaptor for Video Object Detection
TLDR
This paper proposes to summarise the temporal feature into a fixed size representation, which is then used to make the backbone generate adaptively discriminative features for low and high quality frames.
Video Object Detection via Object-Level Temporal Aggregation
TLDR
A detection model is applied on sparse keyframes to handle new objects, occlusions, and rapid motions and then used to exploit temporal cues and track the detected objects in the remaining frames, which enhances efficiency and temporal coherence.
Temporal Context Enhanced Feature Aggregation for Video Object Detection
TLDR
This paper proposes a temporal context enhanced network (TCENet) to exploit temporal context information by temporal aggregation for video object detection, and a temporal stride predictor is proposed to adaptively select video frames for aggregation, which facilitates exploiting variable temporal information.
Adaptive Feature Aggregation for Video Object Detection
TLDR
This paper proposes an adaptive feature aggregation method for video object detection that introduces an adaptive quality-similarity weight, with a sparse and dense temporal aggregation policy, into the model, and consistently demonstrates better performance.
Object Detection in Video with Spatial-temporal Context Aggregation
TLDR
This work proposes a simple but effective feature aggregation framework which operates on the object proposal-level which learns to enhance each proposal's feature via modeling semantic and spatio-temporal relationships among object proposals from both within a frame and across adjacent frames.
Aggregating Motion and Attention for Video Object Detection
TLDR
This paper proposes an Attention-Based Temporal Context module (ABTC) for more accurate frame alignments and shows that the proposed framework performs favorable against the state-of-the-art methods.
Fully Motion-Aware Network for Video Object Detection
TLDR
An end-to-end model called fully motion-aware network (MANet), which jointly calibrates the features of objects on both pixel-level and instance-level in a unified framework, which achieves leading performance on the large-scale ImageNet VID dataset.
CompFeat: Comprehensive Feature Aggregation for Video Instance Segmentation
TLDR
This work proposes a novel comprehensive feature aggregation approach (CompFeat) to refine features at both frame-level and object-level with temporal and spatial context information to eliminate ambiguities introduced by only using single-frame features.
...
...

References

SHOWING 1-10 OF 58 REFERENCES
Seq-NMS for Video Object Detection
TLDR
It is shown that the proposed modification of the post-processing phase that uses high-scoring object detections from nearby frames to boost scores of weaker detections within the same clip obtains superior results to state-of-the-art single image object detection techniques.
Object Detection from Video Tubelets with Convolutional Neural Networks
TLDR
This work introduces a complete framework for the VID task based on still-image object detection and general object tracking, and proposes a temporal convolution network to incorporate temporal information to regularize the detection results and shows its effectiveness for the task.
T-CNN: Tubelets With Convolutional Neural Networks for Object Detection From Videos
TLDR
A deep learning framework that incorporates temporal and contextual information from tubelets obtained in videos, which dramatically improves the baseline performance of existing still-image detection frameworks when they are applied to videos is proposed, called T-CNN.
Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation
TLDR
This paper proposes a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%.
Deep Feature Flow for Video Recognition
TLDR
Deep feature flow is presented, a fast and accurate framework for video recognition that runs the expensive convolutional sub-network only on sparse key frames and propagates their deep feature maps to other frames via a flow field and achieves significant speedup as flow computation is relatively fast.
R-FCN: Object Detection via Region-based Fully Convolutional Networks
TLDR
This work presents region-based, fully convolutional networks for accurate and efficient object detection, and proposes position-sensitive score maps to address a dilemma between translation-invariance in image classification and translation-variance in object detection.
Visual Tracking with Fully Convolutional Networks
TLDR
An in-depth study on the properties of CNN features offline pre-trained on massive image data and classification task on ImageNet shows that the proposed tacker outperforms the state-of-the-art significantly.
SSD: Single Shot MultiBox Detector
TLDR
The approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location, which makes SSD easy to train and straightforward to integrate into systems that require a detection component.
AdaScan: Adaptive Scan Pooling in Deep Convolutional Neural Networks for Human Action Recognition in Videos
TLDR
The effectiveness of the proposed pooling method consistently improves on baseline pooling methods, with both RGB and optical flow based Convolutional networks, and in combination with complementary video representations is shown.
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
TLDR
This work introduces a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals and further merge RPN and Fast R-CNN into a single network by sharing their convolutionAL features.
...
...