The Temporal Opportunist: Self-Supervised Multi-Frame Monocular Depth

@article{Watson2021TheTO,
  title={The Temporal Opportunist: Self-Supervised Multi-Frame Monocular Depth},
  author={Jamie Watson and Oisin Mac Aodha and Victor Adrian Prisacariu and Gabriel J. Brostow and Michael Firman},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2021},
  pages={1164-1174}
}
Self-supervised monocular depth estimation networks are trained to predict scene depth using nearby frames as a supervision signal during training. However, for many applications, sequence information in the form of video frames is also available at test time. The vast majority of monocular networks do not make use of this extra signal, thus ignoring valuable information that could be used to improve the predicted depth. Those that do, either use computationally expensive test-time refinement… 
Self-supervised recurrent depth estimation with attention mechanisms
TLDR
This work proposes a set of modifications that utilize temporal information from previous frames and provide new neural network architectures for monocular depth estimation in a self-supervised manner and shows that proposed modifications can be an effective tool for exploiting temporal information in a depth prediction pipeline.
Multi-Frame Self-Supervised Depth with Transformers
TLDR
Experiments show that the DepthFormer architecture establishes a new state of the art in self-supervised monocular depth estimation, and is even competitive with highly specialized supervised single-frame architectures, and the learned cross-attention network yields representations transferable across datasets, increasing the effectiveness of pre-training strategies.
Self-Supervised Monocular Depth Estimation with Internal Feature Fusion
TLDR
This work proposes a novel depth estimation network DIFFNet, which can make use of semantic information in down and up sampling procedures by applying feature fusion and an attention mechanism, which outperforms the state-of-the-art monocular depth estimation methods on the KITTI benchmark.
Instance-aware multi-object self-supervision for monocular depth prediction
TLDR
The proposed self-supervised monocular image-to-depth prediction framework is shown to largely outperform these methods on standard benchmarks and the impact of the dynamic motion on these benchmarks is exposed.
Forecasting of depth and ego-motion with transformers and self-supervision
TLDR
The architecture is designed using both convolution and transformer modules, which leverages the benefits of both modules: Inductive bias of CNN and the multi-head attention of transformers, thus enabling a rich spatio-temporal representation that enables accurate depth forecasting.
Disentangling Object Motion and Occlusion for Unsupervised Multi-frame Monocular Depth
TLDR
The method, called DynamicDepth, is a new framework trained via a self-supervised cycle consistent learning scheme to solve the mismatch problem and significantly outperforms the state-of-the-art monocular depth prediction methods, especially in the areas of dynamic objects.
SUB-Depth: Self-distillation and Uncertainty Boosting Self-supervised Monocular Depth Estimation
TLDR
This work proposes SUB-Depth, a universal multi-task training framework for self-supervised monocular depth estimation (SDE), and proposes homoscedastic uncertainty formulations for each task to penalise areas likely to be affected by teacher network noise, or violate SDE assumptions.
Learning Optical Flow, Depth, and Scene Flow Without Real-World Labels
TLDR
This letter proposes DRAFT, a new method capable of jointly learning depth, optical flow, and scene flow by combining synthetic data with geometric self-supervision, and builds upon the RAFT architecture.
MonoRec: Semi-Supervised Dense Reconstruction in Dynamic Environments from a Single Moving Camera
TLDR
A semi-supervised monocular dense reconstruction architecture that predicts depth maps from a single moving camera in dynamic environments and introduces a MaskModule that predicts moving object masks by leveraging the photometric inconsistencies encoded in the cost volumes.
...
...

References

SHOWING 1-10 OF 105 REFERENCES
Self-Supervised Joint Learning Framework of Depth Estimation via Implicit Cues
TLDR
This work proposes a novel self-supervised joint learning framework for depth estimation using consecutive frames from monocular and stereo videos using an implicit depth cue extractor which leverages dynamic and static cues to generate useful depth proposals.
Self-Supervised Monocular Trained Depth Estimation Using Self-Attention and Discrete Disparity Volume
TLDR
The extension of the state-of-the-art self-supervised monocular trained depth estimator Monodepth2 with these two ideas allows us to design a model that produces the best results in the field in KITTI 2015 and Make3D, closing the gap with respect self- supervised stereo training and fully supervised approaches.
Don’t Forget The Past: Recurrent Depth Estimation from Monocular Video
TLDR
This work is the first to successfully exploit recurrent networks for real-time self-supervised monocular depth estimation and completion and outperforms previous depth estimation methods of the three popular groups.
CoMoDA: Continuous Monocular Depth Adaptation Using Past Experiences
TLDR
This paper proposes a novel self-supervised Continuous Monocular Depth Adaptation method (CoMoDA), which adapts the pretrained model on a test video on the fly and achieves state-of-the-art depth estimation performance and surpass all existing methods using standard architectures.
Monocular Depth Estimation with Self-supervised Instance Adaptation
TLDR
This work proposed a new approach that extends any off-the-shelf self-supervised monocular depth reconstruction system to use more than one image at test time, and comes very close in accuracy whencompared to the fully-super supervised state-of- the-art methods.
Unsupervised Scale-consistent Depth and Ego-motion Learning from Monocular Video
TLDR
This paper proposes a geometry consistency loss for scale-consistent predictions and an induced self-discovered mask for handling moving objects and occlusions and is the first work to show that deep networks trained using unlabelled monocular videos can predict globally scale- Consistent camera trajectories over a long video sequence.
Self-Supervised Monocular Depth Hints
TLDR
This work studies the problem of ambiguous reprojections in depth-prediction from stereo-based self-supervision, and introduces Depth Hints to alleviate their effects, and produces state-of-the-art depth predictions on the KITTI benchmark.
Digging Into Self-Supervised Monocular Depth Estimation
TLDR
It is shown that a surprisingly simple model, and associated design choices, lead to superior predictions, and together result in both quantitatively and qualitatively improved depth maps compared to competing self-supervised methods.
DepthNet: A Recurrent Neural Network Architecture for Monocular Depth Prediction
TLDR
This paper proposes a novel convolutional LSTM (ConvLSTM)-based network architecture for depth prediction from a monocular video sequence that harnesses the ability of long short-term memory (LSTm)-based RNNs to reason sequentially and predict the depth map for an image frame as a function of the appearances of scene objects in the image frame.
Recurrent Neural Network for (Un-)Supervised Learning of Monocular Video Visual Odometry and Depth
TLDR
A learning-based, multi-view dense depth map and odometry estimation method that uses Recurrent Neural Networks (RNN) and trains utilizing multi- view image reprojection and forward-backward flow-consistency losses to produce superior results on the KITTI driving dataset.
...
...