• Corpus ID: 143423340

Self-supervised Learning for Video Correspondence Flow

@article{Lai2019SelfsupervisedLF,
  title={Self-supervised Learning for Video Correspondence Flow},
  author={Zihang Lai and Weidi Xie},
  journal={ArXiv},
  year={2019},
  volume={abs/1905.00875}
}
The objective of this paper is self-supervised learning of feature embeddings that are suitable for matching correspondences along the videos, which we term correspondence flow. By leveraging the natural spatial-temporal coherence in videos, we propose to train a ``pointer'' that reconstructs a target frame by copying pixels from a reference frame. We make the following contributions: First, we introduce a simple information bottleneck that forces the model to learn robust features for… 
Contrastive Transformation for Self-supervised Correspondence Learning
TLDR
This method simultaneously considers intra- and inter-video representation associations for reliable correspondence estimation and obtains the discriminative representation for instance-level separation to facilitate the contrastive transformation across different videos.
Self-supervised Video Object Segmentation
TLDR
This paper proposes to improve the existing self-supervised approach, with a simple, yet more effective memory mechanism for long-term correspondence matching, which resolves the challenge caused by the dis-appearance and reappearance of objects.
Joint-task Self-supervised Learning for Temporal Correspondence
TLDR
This method outperforms the state-of-the-art self-supervised methods on a variety of visual correspondence tasks, including video-object and part-segmentation propagation, keypoint tracking, and object tracking.
MAST: A Memory-Augmented Self-Supervised Tracker
TLDR
A dense tracking model trained on videos without any annotations is proposed that surpasses previous self-supervised methods on existing benchmarks by a significant margin, and achieves performance comparable to supervised methods.
Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting
TLDR
A novel method which predicts consistent prototype assignments from both RGB and optical flow views, operating on sets of samples, and obtains state-of-the-art results on nearestneighbour video retrieval and action recognition.
Self-Supervised Video Object Segmentation by Motion-Aware Mask Propagation
TLDR
Evaluation on DAVis-2017 and YouTube-VOS datasets show that MAMP achieves state-of-the-art performance with stronger generalization ability compared to existing self-supervised methods, i.e. 4.9% higher mean J&F on DAVIS- 2017 and 4.85% highermean J &F on the unseen categories of YouTube- VOS than the nearest competitor.
Self-supervised Object Tracking with Cycle-consistent Siamese Networks
TLDR
This work proposes to integrate a Siamese region proposal and mask regression network in a cycle-consistent self-supervised framework for object tracking so that a fast and more accurate tracker can be learned without the annotation of each frame.
Correspondence Networks With Adaptive Neighbourhood Consensus
TLDR
This paper proposes a convolutional neural network architecture, called adaptive neighbourhood consensus network (ANC-Net), that can be trained end-to-end with sparse key-point annotations, to handle the task of establishing dense visual correspondences between images containing objects of the same category.
Self-supervised Video Object Segmentation by Motion Grouping
TLDR
A simple variant of the Transformer is introduced to segment optical flow frames into primary objects and the background, which can be trained in a self-supervised manner, i.e. without using any manual annotations, and achieves superior results compared to previous state-of-the-art self- supervised methods on public benchmarks.
Learning Video Correspondence using Appearance Module for Target Tracking
TLDR
A new method for self-supervised video correspondence matching to effectively track targets in battlefield situations that preserves the detailed shape of an object by handling high resolution information while high computational costs due to the correlation filter is alleviated by a small search window.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 63 REFERENCES
Geometry Guided Convolutional Neural Networks for Self-Supervised Video Representation Learning
TLDR
Geometry is explored, a grand new type of auxiliary supervision for the self-supervised learning of video representations, and it is found that the convolutional neural networks pre-trained by the geometry cues can be effectively adapted to semantic video understanding tasks.
Self-supervised Spatiotemporal Feature Learning by Video Geometric Transformations
TLDR
A novel 3DConvNet-based fully selfsupervised framework to learn spatiotemporal video features without using any human-labeled annotations and outperforms the state-of-the-arts of fully self-supervised methods on both UCF101 and HMDB51 datasets and achieves 62.9% and 33.7% accuracy respectively.
AnchorNet: A Weakly Supervised Network to Learn Geometry-Sensitive Features for Semantic Matching
TLDR
This work proposes a deep network, termed AnchorNet, that produces image representations that are well-suited for semantic matching that are trained only with weak image-level labels and improves results of state-of-the-art semantic matching methods such as the Deformable Spatial Pyramid or the Proposal Flow methods.
Self-Supervised Spatiotemporal Feature Learning via Video Rotation Prediction.
TLDR
With the self-supervised pre-trained 3DRotNet from large datasets, the recognition accuracy is boosted up by 20.4% on UCF101 and 16.7% on HMDB51 respectively, compared to the models trained from scratch.
End-to-End Weakly-Supervised Semantic Alignment
TLDR
A convolutional neural network architecture for semantic alignment that is trainable in an end-to-end manner from weak image-level supervision in the form of matching image pairs that computes the quality of the alignment based on only geometrically consistent correspondences thereby reducing the effect of background clutter.
UnFlow: Unsupervised Learning of Optical Flow with a Bidirectional Census Loss
TLDR
This work designs an unsupervised loss based on occlusion-aware bidirectional flow estimation and the robust census transform to circumvent the need for ground truth flow, enabling generic pre-training of supervised networks for datasets with limited amounts of ground truth.
Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles
TLDR
This paper introduces a new self-supervised task called Space-Time Cubic Puzzles, which requires a network to arrange permuted 3D spatio-temporal crops and learns both spatial appearance and temporal relation of video frames, which is the final goal.
Learning Correspondence From the Cycle-Consistency of Time
TLDR
A self-supervised method to use cycle-consistency in time as free supervisory signal for learning visual representations from scratch and demonstrates the generalizability of the representation -- without finetuning -- across a range of visual correspondence tasks, including video object segmentation, keypoint tracking, and optical flow.
Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification
TLDR
This paper forms an approach for learning a visual representation from the raw spatiotemporal signals in videos using a Convolutional Neural Network, and shows that this method captures information that is temporally varying, such as human pose.
Thin-Slicing Network: A Deep Structured Model for Pose Estimation in Videos
TLDR
The proposed end-to-end architecture is evaluated on two widely used benchmarks for video-based pose estimation (Penn Action and JHMDB datasets) and outperforms several state-of-the-art methods.
...
1
2
3
4
5
...