Consistent depth of moving objects in video

@article{Zhang2021ConsistentDO,
  title={Consistent depth of moving objects in video},
  author={Zhoutong Zhang and Forrester Cole and Richard Tucker and William T. Freeman and Tali Dekel},
  journal={ACM Transactions on Graphics (TOG)},
  year={2021},
  volume={40},
  pages={1 - 12}
}
We present a method to estimate depth of a dynamic scene, containing arbitrary moving objects, from an ordinary video captured with a moving camera. We seek a geometrically and temporally consistent solution to this under-constrained problem: the depth predictions of corresponding points across frames should induce plausible, smooth motion in 3D. We formulate this objective in a new test-time training framework where a depth-prediction CNN is trained in tandem with an auxiliary scene-flow… Expand

References

SHOWING 1-10 OF 98 REFERENCES
Web Stereo Video Supervision for Depth Prediction from Dynamic Scenes
TLDR
A fully data-driven method to compute depth from diverse monocular video sequences that contain large amounts of non-rigid objects, e.g., people, and a loss function that allows for a depth prediction even with unknown camera intrinsics and stereo baselines in the dataset is presented. Expand
Video Pop-up: Monocular 3D Reconstruction of Dynamic Scenes
TLDR
An unsupervised approach to the challenging problem of simultaneously segmenting the scene into its constituent objects and reconstructing a 3D model of the scene and evaluating the motion segmentation functionality of the approach on the Berkeley Motion Segmentation Dataset. Expand
Learning the Depths of Moving People by Watching Frozen People
  • Z. Li, Tali Dekel, +4 authors W. Freeman
  • Computer Science
  • 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2019
TLDR
This paper takes a data-driven approach and learns human depth priors from a new source of data: thousands of Internet videos of people imitating mannequins, i.e., freezing in diverse, natural poses, while a hand-held camera tours the scene. Expand
Every Pixel Counts: Unsupervised Geometry Learning with Holistic 3D Motion Understanding
TLDR
Experiments on KITTI 2015 dataset show that the estimated geometry, 3D motion and moving object masks, not only are constrained to be consistent, but also significantly outperforms other SOTA algorithms, demonstrating the benefits of the approach. Expand
Fast Multi-frame Stereo Scene Flow with Motion Segmentation
TLDR
A new multi-frame method for efficiently computing scene flow and camera ego-motion for a dynamic scene observed from a moving stereo camera rig, where the method consistently outperforms OSF, which is currently ranked second on the KITTI benchmark. Expand
Novel View Synthesis of Dynamic Scenes With Globally Coherent Depths From a Monocular Camera
TLDR
This paper presents a new method to synthesize an image from arbitrary views and times given a collection of images of a dynamic scene, and evaluates the method of depth estimation and view synthesis on a diverse real-world dynamic scenes and shows the outstanding performance over existing methods. Expand
Unsupervised Learning of Depth and Ego-Motion from Video
TLDR
Empirical evaluation demonstrates the effectiveness of the unsupervised learning framework for monocular depth performs comparably with supervised methods that use either ground-truth pose or depth for training, and pose estimation performs favorably compared to established SLAM systems under comparable input settings. Expand
MannequinChallenge: Learning the Depths of Moving People by Watching Frozen People.
  • Z. Li, Tali Dekel, +4 authors W. Freeman
  • Medicine, Computer Science
  • IEEE transactions on pattern analysis and machine intelligence
  • 2020
TLDR
This paper takes a data-driven approach and learns human depth priors from a new source of data: thousands of Internet videos of people imitating mannequins, i.e., freezing in diverse, natural poses, while a hand-held camera tours the scene. Expand
Dense Monocular Depth Estimation in Complex Dynamic Scenes
TLDR
A novel motion segmentation algorithm is provided that segments the optical flow field into a set of motion models, each with its own epipolar geometry, and it is shown that the scene can be reconstructed based on these motion models by optimizing a convex program. Expand
Layered neural rendering for retiming people in video
TLDR
A key property of this model is that it not only disentangles the direct motions of each person in the input video, but also correlates each person automatically with the scene changes they generate---e.g., shadows, reflections, and motion of loose clothing. Expand
...
1
2
3
4
5
...