Flow-Guided Feature Aggregation for Video Object Detection
@article{Zhu2017FlowGuidedFA, title={Flow-Guided Feature Aggregation for Video Object Detection}, author={Xizhou Zhu and Yujie Wang and Jifeng Dai and Lu Yuan and Yichen Wei}, journal={2017 IEEE International Conference on Computer Vision (ICCV)}, year={2017}, pages={408-417} }
Extending state-of-the-art object detectors from image to video is challenging. The accuracy of detection suffers from degenerated object appearances in videos, e.g., motion blur, video defocus, rare poses, etc. Existing work attempts to exploit temporal information on box level, but such methods are not trained end-to-end. We present flow-guided feature aggregation, an accurate and end-to-end learning framework for video object detection. It leverages temporal coherence on feature level…
Figures and Tables from this paper
374 Citations
Learning Where to Focus for Efficient Video Object Detection
- Computer ScienceECCV
- 2020
A novel module called Learnable Spatio-Temporal Sampling (LSTS) has been proposed to learn semantic-level correspondences among adjacent frame features accurately and achieves state-of-the-art performance on the ImageNet VID dataset with less computational complexity and real-time speed.
Temporal Meta-Adaptor for Video Object Detection
- Computer ScienceBMVC
- 2021
This paper proposes to summarise the temporal feature into a fixed size representation, which is then used to make the backbone generate adaptively discriminative features for low and high quality frames.
Video Object Detection via Object-Level Temporal Aggregation
- Computer ScienceECCV
- 2020
A detection model is applied on sparse keyframes to handle new objects, occlusions, and rapid motions and then used to exploit temporal cues and track the detected objects in the remaining frames, which enhances efficiency and temporal coherence.
Temporal Context Enhanced Feature Aggregation for Video Object Detection
- Computer ScienceAAAI
- 2020
This paper proposes a temporal context enhanced network (TCENet) to exploit temporal context information by temporal aggregation for video object detection, and a temporal stride predictor is proposed to adaptively select video frames for aggregation, which facilitates exploiting variable temporal information.
Adaptive Feature Aggregation for Video Object Detection
- Computer Science2020 IEEE Winter Applications of Computer Vision Workshops (WACVW)
- 2020
This paper proposes an adaptive feature aggregation method for video object detection that introduces an adaptive quality-similarity weight, with a sparse and dense temporal aggregation policy, into the model, and consistently demonstrates better performance.
Video object detection for autonomous driving: Motion-aid feature calibration
- Computer ScienceNeurocomputing
- 2020
Object Detection in Video with Spatial-temporal Context Aggregation
- Computer ScienceArXiv
- 2019
This work proposes a simple but effective feature aggregation framework which operates on the object proposal-level which learns to enhance each proposal's feature via modeling semantic and spatio-temporal relationships among object proposals from both within a frame and across adjacent frames.
Aggregating Motion and Attention for Video Object Detection
- Computer ScienceACPR
- 2019
This paper proposes an Attention-Based Temporal Context module (ABTC) for more accurate frame alignments and shows that the proposed framework performs favorable against the state-of-the-art methods.
Fully Motion-Aware Network for Video Object Detection
- Computer ScienceECCV
- 2018
An end-to-end model called fully motion-aware network (MANet), which jointly calibrates the features of objects on both pixel-level and instance-level in a unified framework, which achieves leading performance on the large-scale ImageNet VID dataset.
CompFeat: Comprehensive Feature Aggregation for Video Instance Segmentation
- Computer ScienceAAAI
- 2021
This work proposes a novel comprehensive feature aggregation approach (CompFeat) to refine features at both frame-level and object-level with temporal and spatial context information to eliminate ambiguities introduced by only using single-frame features.
References
SHOWING 1-10 OF 58 REFERENCES
Seq-NMS for Video Object Detection
- Computer ScienceArXiv
- 2016
It is shown that the proposed modification of the post-processing phase that uses high-scoring object detections from nearby frames to boost scores of weaker detections within the same clip obtains superior results to state-of-the-art single image object detection techniques.
Object Detection from Video Tubelets with Convolutional Neural Networks
- Computer Science2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016
This work introduces a complete framework for the VID task based on still-image object detection and general object tracking, and proposes a temporal convolution network to incorporate temporal information to regularize the detection results and shows its effectiveness for the task.
T-CNN: Tubelets With Convolutional Neural Networks for Object Detection From Videos
- Computer ScienceIEEE Transactions on Circuits and Systems for Video Technology
- 2018
A deep learning framework that incorporates temporal and contextual information from tubelets obtained in videos, which dramatically improves the baseline performance of existing still-image detection frameworks when they are applied to videos is proposed, called T-CNN.
Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation
- Computer Science2014 IEEE Conference on Computer Vision and Pattern Recognition
- 2014
This paper proposes a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%.
Deep Feature Flow for Video Recognition
- Computer Science2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017
Deep feature flow is presented, a fast and accurate framework for video recognition that runs the expensive convolutional sub-network only on sparse key frames and propagates their deep feature maps to other frames via a flow field and achieves significant speedup as flow computation is relatively fast.
R-FCN: Object Detection via Region-based Fully Convolutional Networks
- Computer ScienceNIPS
- 2016
This work presents region-based, fully convolutional networks for accurate and efficient object detection, and proposes position-sensitive score maps to address a dilemma between translation-invariance in image classification and translation-variance in object detection.
Visual Tracking with Fully Convolutional Networks
- Computer Science2015 IEEE International Conference on Computer Vision (ICCV)
- 2015
An in-depth study on the properties of CNN features offline pre-trained on massive image data and classification task on ImageNet shows that the proposed tacker outperforms the state-of-the-art significantly.
SSD: Single Shot MultiBox Detector
- Computer ScienceECCV
- 2016
The approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location, which makes SSD easy to train and straightforward to integrate into systems that require a detection component.
AdaScan: Adaptive Scan Pooling in Deep Convolutional Neural Networks for Human Action Recognition in Videos
- Computer Science2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017
The effectiveness of the proposed pooling method consistently improves on baseline pooling methods, with both RGB and optical flow based Convolutional networks, and in combination with complementary video representations is shown.
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
- Computer ScienceIEEE Transactions on Pattern Analysis and Machine Intelligence
- 2015
This work introduces a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals and further merge RPN and Fast R-CNN into a single network by sharing their convolutionAL features.