Correlation-Aware Deep Tracking

  title={Correlation-Aware Deep Tracking},
  author={Fei Xie and Chunyu Wang and Guangting Wang and Yue Cao and Wankou Yang and Wenjun Zeng},
  journal={2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  • Fei XieChunyu Wang Wenjun Zeng
  • Published 3 March 2022
  • Computer Science
  • 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Robustness and discrimination power are two fundamental requirements in visual object tracking. In most tracking paradigms, we find that the features extracted by the popular Siamese-like networks cannot fully discriminatively model the tracked targets and distractor objects, hindering them from simultaneously meeting these two requirements. While most methods focus on designing robust correlation operations, we propose a novel target-dependent feature network inspired by the self-/cross… 

EgoTracks: A Long-term Egocentric Visual Object Tracking Dataset

EgoTracks is a new dataset for long-term egocentric visual object tracking, sourced from the Ego4D dataset, and presents a challenge to recent state-of-the-art single-object tracking models, which score poorly on traditional tracking metrics for this new dataset, compared to popular benchmarks.

Beyond SOT: It's Time to Track Multiple Generic Objects at Once

This work introduces a new large-scale GOT benchmark, LaGOT, containing multiple annotated target objects per sequence, and proposes a Transformer-based GOT tracker TaMOS capable of joint processing of multiple objects through shared computation.

Object Tracking Based on a Time-Varying Spatio-Temporal Regularized Correlation Filter With Aberrance Repression

An object tracking model based on a time-varying Spatio-temporal regularized correlation filter with aberrance repression is proposed, which outperformed many state-of-the-art trackers based on DCF and deep-based frameworks in terms of tracking accuracy, tracking success rate, and A-R rank.

MLPT: Multilayer Perceptron based Tracking

This paper presents a simple yet effective Multilayer Perceptron-based Tracking (MLPT), including the global receptive field, which is the first baseline of MLP-based architecture for object tracking.

SRRT: Search Region Regulation Tracking

A novel tracking paradigm is proposed, called Search Region Reg- ulation Tracking (SRRT), which applies a proposed search region regulator to estimate an optimal search region dynam- ically for every frame to adapt the object’s appearance variation during tracking.

A robust spatial-temporal correlation filter tracker for efficient UAV visual tracking

In this work, a robust spatial-temporal correlation filter, i.e., the temporal regularized background-aware correlation filter (TRBCF), is proposed, which improves the discriminability between the target and background in the spatial domain, and achieves continuous tracking in temporal sequences.

Learning Localization-aware Target Confidence for Siamese Visual Tracking

The proposed SiamLA tracking paradigm achieves state-of-the-art performance in terms of both accuracy and efficiency, and is relatively stable, implying the paradigm is potential to real-world applications.

MixFormer: End-to-End Tracking with Iterative Mixed Attention

This paper proposes a compact tracking framework, termed as MixFormer, built upon transformers, to utilize the flexibility of attention operations, and proposes a Mixed Attention Module (MAM) for simultaneous feature extraction and target information integration.

SwinTrack: A Simple and Strong Baseline for Transformer Tracking

This paper proposes a simple yet efficient fully-attentional tracker, dubbed SwinTrack, within classic Siamese framework that leverages the Transformer architecture, enabling better feature interactions for tracking than pure CNN or hybrid CNN-Transformer frameworks.



Transformer Tracking

This work presents a novel attention-based feature fusion network, which effectively combines the template and search region features solely using attention and presents a Transformer tracking method based on the Siamese-like feature extraction backbone, the designed attention- based fusion mechanism, and the classification and regression head.

ResT: An Efficient Transformer for Visual Recognition

Experimental results show that the proposed ResT can outperform the recently state-of-the-art backbones by a large margin, demonstrating the potential of ResT as strong backbones.

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

The Pyramid Vision Transformer (PVT) is introduced, which overcomes the difficulties of porting Transformer to various dense prediction tasks and is validated through extensive experiments, showing that it boosts the performance of many downstream tasks, including object detection, instance and semantic segmentation.

SiamFC++: Towards Robust and Accurate Visual Tracking with Target Estimation Guidelines

This work proposes a set of practical guidelines of target state estimation for high-performance generic object tracker design and designs the Fully Convolutional Siamese tracker++ (SiamFC++), which achieves state-of-the-art performance on five challenging benchmarks, which proves both the tracking and generalization ability of the tracker.

Video Object Segmentation Using Space-Time Memory Networks

This work proposes a novel solution for semi-supervised video object segmentation by leveraging memory networks and learning to read relevant information from all available sources to better handle the challenges such as appearance changes and occlussions.

Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression

This paper introduces a generalized version of IoU ( GIoU) as a loss into the state-of-the art object detection frameworks, and shows a consistent improvement on their performance using both the standard, IoU based, and new, GIo U based, performance measures on popular object detection benchmarks.

ATOM: Accurate Tracking by Overlap Maximization

This work proposes a novel tracking architecture, consisting of dedicated target estimation and classification components, and introduces a classification component that is trained online to guarantee high discriminative power in the presence of distractors.

GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild

A large tracking database that offers an unprecedentedly wide coverage of common moving objects in the wild, called GOT-10k, and the first video trajectory dataset that uses the semantic hierarchy of WordNet to guide class population, which ensures a comprehensive and relatively unbiased coverage of diverse moving objects.

A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation

This work presents a new benchmark dataset and evaluation methodology for the area of video object segmentation, named DAVIS (Densely Annotated VIdeo Segmentation), and provides a comprehensive analysis of several state-of-the-art segmentation approaches using three complementary metrics.

Deep Residual Learning for Image Recognition

This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.