• Corpus ID: 210064496

An Analysis of Object Representations in Deep Visual Trackers

  title={An Analysis of Object Representations in Deep Visual Trackers},
  author={Ross Goroshin and Jonathan Tompson and Debidatta Dwibedi},
Fully convolutional deep correlation networks are integral components of state-of the-art approaches to single object visual tracking. It is commonly assumed that these networks perform tracking by detection by matching features of the object instance with features of the entire frame. Strong architectural priors and conditioning on the object representation is thought to encourage this tracking strategy. Despite these strong priors, we show that deep trackers often default to tracking by… 

Figures from this paper

Learning Target-aware Representation for Visual Tracking via Informative Interactions

A novel backbone architecture with multiple branch-wise interactions inside the Siamese-like backbone networks (InBN) that injects the target information to different stages of the backbone network, leading to better target-perception of candidate feature representation with negligible computation cost.



Fully-Convolutional Siamese Networks for Object Tracking

A basic tracking algorithm is equipped with a novel fully-convolutional Siamese network trained end-to-end on the ILSVRC15 dataset for object detection in video and achieves state-of-the-art performance in multiple benchmarks.

Learning Multi-domain Convolutional Neural Networks for Visual Tracking

A novel visual tracking algorithm based on the representations from a discriminatively trained Convolutional Neural Network using a large set of videos with tracking ground-truths to obtain a generic target representation.

Incremental Learning for Visual Tracking

This paper presents an efficient and effective online algorithm that incrementally learns and adapts a low dimensional eigenspace representation to reflect appearance changes of the target, thereby facilitating the tracking task.

Learning to Track at 100 FPS with Deep Regression Networks

This work proposes a method for offline training of neural networks that can track novel objects at test-time at 100 fps, which is significantly faster than previous methods that use neural networks for tracking, which are typically very slow to run and not practical for real-time applications.

Detect to Track and Track to Detect

This paper sets up a ConvNet architecture for simultaneous detection and tracking, using a multi-task objective for frame-based object detection and across-frame track regression, and introduces correlation features that represent object co-occurrences across time to aid the ConvNet during tracking.

Online Object Tracking: A Benchmark

Large scale experiments are carried out with various evaluation criteria to identify effective approaches for robust tracking and provide potential future research directions in this field.

On the connections between saliency and tracking

This work identifies three main predictions that must hold if the saliency hypothesis for tracking were true, and shows that the third prediction holds by constructing a common neurophysiologically plausible architecture that can computationally solve both saliency and tracking.

You Only Look Once: Unified, Real-Time Object Detection

Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background, and outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.

OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks

This integrated framework for using Convolutional Networks for classification, localization and detection is the winner of the localization task of the ImageNet Large Scale Visual Recognition Challenge 2013 and obtained very competitive results for the detection and classifications tasks.

SiamRPN++: Evolution of Siamese Visual Tracking With Very Deep Networks

This work proves the core reason Siamese trackers still have accuracy gap comes from the lack of strict translation invariance, and proposes a new model architecture to perform depth-wise and layer-wise aggregations, which not only improves the accuracy but also reduces the model size.