Detect to Track and Track to Detect

@article{Feichtenhofer2017DetectTT,
  title={Detect to Track and Track to Detect},
  author={Christoph Feichtenhofer and Axel Pinz and Andrew Zisserman},
  journal={2017 IEEE International Conference on Computer Vision (ICCV)},
  year={2017},
  pages={3057-3065}
}
Recent approaches for high accuracy detection and tracking of object categories in video consist of complex multistage solutions that become more cumbersome each year. [] Key Method Our contributions are threefold: (i) we set up a ConvNet architecture for simultaneous detection and tracking, using a multi-task objective for frame-based object detection and across-frame track regression; (ii) we introduce correlation features that represent object co-occurrences across time to aid the ConvNet during tracking…

Figures and Tables from this paper

Learning to Track Object Position through Occlusion
TLDR
This work proposes a ‘tracking-by-detection‘ approach that builds upon the success of region based video object detectors and uses a novel recurrent computational unit at its core that enables long term propagation of object features even under occlusion.
TrackNet: Simultaneous Object Detection and Tracking and Its Application in Traffic Video Analysis
TLDR
A novel network structure named trackNet is proposed that can directly detect a 3D tube enclosing a moving object in a video segment by extending the faster R-CNN framework and is applicable for detecting and tracking any object.
Joint Detection and Online Multi-object Tracking
TLDR
This work proposes a multiple object tracking method that jointly performs detection and tracking in a single neural network architecture, and adapts the Single Shot MultiBox Detector to serve single frame detection to a recurrent neural network (RNN), which combines detections into tracks.
DEFT: Detection Embeddings for Tracking
TLDR
This paper proposes an efficient joint detection and tracking model named DEFT, or “Detection Embeddings for Tracking", which relies on an appearance-based object matching network jointly-learned with an underlying object detection network.
Simultaneous Detection and Tracking with Motion Modelling for Multiple Object Tracking
TLDR
This work introduces Deep Motion Modeling Network (DMM-Net) that can estimate multiple objects' motion parameters to perform joint detection and association in an end-to-end manner and demonstrates the suitability of Omni-MOT for deep learning with DMMNet and makes the source code of the network public.
Detect or Track: Towards Cost-Effective Video Object Detection/Tracking
TLDR
A scheduler network, which determines to detect or track at a certain frame, as a generalization of Siamese trackers is proposed, which is more effective than the frame skipping baselines and flow-based approaches in video object detection/tracking.
DeTracker: A Joint Detection and Tracking Framework
TLDR
DeTracker is introduced, a truly joint detection and tracking network that enforce an intra-batch temporal consistency of features by enforcing a triplet loss over the authors' tracklets, guiding the features of tracklets with different identities separately clustered in the feature space.
Real-Time Online Multi-Object Tracking: A Joint Detection and Tracking Framework
TLDR
A joint detection and tracking framework is proposed with a unified confidence scoring function to evaluate tracks confidence and complement low confidence detections with high confidence tracks, so that detections and tracks can be combined organically and achieved complementarity.
TDIOT: Target-Driven Inference for Deep Video Object Tracking
TLDR
The proposed single object tracker, TDIOT, applies an appearance similarity-based temporal matching for data association and incorporates a local search and matching module into the inference head layer that exploits SiamFC for short term tracking.
...
...

References

SHOWING 1-10 OF 49 REFERENCES
You Only Look Once: Unified, Real-Time Object Detection
TLDR
Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background, and outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.
Visual Tracking with Fully Convolutional Networks
TLDR
An in-depth study on the properties of CNN features offline pre-trained on massive image data and classification task on ImageNet shows that the proposed tacker outperforms the state-of-the-art significantly.
Hierarchical Convolutional Features for Visual Tracking
TLDR
This paper adaptively learn correlation filters on each convolutional layer to encode the target appearance and hierarchically infer the maximum response of each layer to locate targets.
Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation
TLDR
This paper proposes a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%.
Object Detection from Video Tubelets with Convolutional Neural Networks
TLDR
This work introduces a complete framework for the VID task based on still-image object detection and general object tracking, and proposes a temporal convolution network to incorporate temporal information to regularize the detection results and shows its effectiveness for the task.
Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos
TLDR
A huge leap forward in action detection performance is achieved and 20% and 11% gain in mAP are reported on UCF-101 and J-HMDB-21 datasets respectively when compared to the state-of-the-art.
R-FCN: Object Detection via Region-based Fully Convolutional Networks
TLDR
This work presents region-based, fully convolutional networks for accurate and efficient object detection, and proposes position-sensitive score maps to address a dilemma between translation-invariance in image classification and translation-variance in object detection.
Object Detection in Videos with Tubelet Proposal Networks
TLDR
A framework for object detection in videos is proposed, which consists of a novel tubelet proposal network to efficiently generate spatiotemporal proposals, and a Long Short-term Memory network that incorporates temporal information from tubelet proposals for achieving high object detection accuracy in videos.
SSD: Single Shot MultiBox Detector
TLDR
The approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location, which makes SSD easy to train and straightforward to integrate into systems that require a detection component.
Unsupervised Object Discovery and Tracking in Video Collections
TLDR
This paper addresses the problem of automatically localizing dominant objects as spatio-temporal tubes in a noisy collection of videos with minimal or even no supervision as a combination of two complementary processes: discovery and tracking.
...
...