Spatio-temporal Self-Supervised Representation Learning for 3D Point Clouds

@article{Huang2021SpatiotemporalSR,
  title={Spatio-temporal Self-Supervised Representation Learning for 3D Point Clouds},
  author={Siyuan Huang and Yichen Xie and Song-Chun Zhu and Yixin Zhu},
  journal={2021 IEEE/CVF International Conference on Computer Vision (ICCV)},
  year={2021},
  pages={6515-6525}
}
To date, various 3D scene understanding tasks still lack practical and generalizable pre-trained models, primarily due to the intricate nature of 3D scene understanding tasks and their immense variations introduced by camera views, lighting, occlusions, etc. In this paper, we tackle this challenge by introducing a spatio-temporal representation learning (STRL) framework, capable of learning from unlabeled 3D point clouds in a self-supervised fashion. Inspired by how infants learn from visual… 

Figures and Tables from this paper

Self-Supervised Point Cloud Representation Learning with Occlusion Auto-Encoder
TLDR
A novel selfsupervised point cloud representation learning framework, named 3D Occlusion Auto-Encoder (3D-OAE), which can remove a large proportion of patches and predict them only with a small number of visible patches, which enable it to significantly accelerate training and yield a nontrivial self-supervisory performance.
Self-Supervised Feature Learning from Partial Point Clouds via Pose Disentanglement
TLDR
This paper proposes a novel self-supervised framework to learn informative representations from partial point clouds that outperforms existing self- supervised methods, but also shows a better generalizability across synthetic and realworld datasets.
Unsupervised Learning on 3D Point Clouds by Clustering and Contrasting
TLDR
An Expectation-Maximization like soft clustering algorithm that provides local supervision to extract discriminating local features based on optimal transport and an instance-level contrasting method to learn the global geometry, formulated by maximizing the similarity between two augmentations of one point cloud.
Pointly-supervised 3D Scene Parsing with Viewpoint Bottleneck
TLDR
A principled analysis shows that viewpoint bottleneck leads to an elegant surrogate loss function that is suitable for large-scale point cloud data and has several advantages: It is easy to implement and tune, does not need negative samples and performs better on the authors' goal down-streaming task.
Image-to-Lidar Self-Supervised Distillation for Autonomous Driving Data
TLDR
A self-supervised pretraining method for 3D perception models that is tailored to autonomous driving data and leverage the availability of synchronized and calibrated image and Lidar sensors in autonomous driving setups for distilling selfsupervised pre-trained image representations into 3D models.
Implicit Autoencoder for Point Cloud Self-supervised Representation Learning
TLDR
Implicit Autoencoder (IAE) is introduced, a simple yet effective method that addresses the challenge of autoencoding on point clouds by replacing the point cloud decoder with an implicit decoder that outputs a continuous representation that is shared among different point cloud sampling of the same model.
Distillation with Contrast is All You Need for Self-Supervised Point Cloud Representation Learning
TLDR
A simple and general framework for self-supervised point cloud representation learning that achieves the state-of-theart performance on linear classification and multiple other downstream tasks and combines contrastive learning with knowledge distillation to make the teacher network be better updated.
Masked Discrimination for Self-Supervised Learning on Point Clouds
TLDR
This paper proposes a discriminative mask pretraining Transformer framework, MaskPoint, for point clouds, to represent the point cloud as discrete occupancy values, and performs simple binary classification between masked object points and sampled noise points as the proxy task.
4DContrast: Contrastive Learning with Dynamic Correspondences for 3D Scene Understanding
TLDR
A new approach to instill 4D dynamic object priors into learned 3D representations by unsupervised pretraining is presented and results in improvement in downstream 3D semantic segmentation, object detection, and instance segmentation tasks, and notably improves performance in data-scarce scenarios.
Language-Grounded Indoor 3D Semantic Segmentation in the Wild
TLDR
A language-driven pre-training method to encourage learned 3D features that might have limited training examples to lie close to their pre-trained text embeddings and consistently out-performs state-of-the-art 3D pre- training for 3D semantic segmentation on a proposed benchmark.
...
1
2
...

References

SHOWING 1-10 OF 83 REFERENCES
ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes
TLDR
This work introduces ScanNet, an RGB-D video dataset containing 2.5M views in 1513 scenes annotated with 3D camera poses, surface reconstructions, and semantic segmentations, and shows that using this data helps achieve state-of-the-art performance on several 3D scene understanding tasks.
Representation Learning and Adversarial Generation of 3D Point Clouds
TLDR
This paper introduces a deep autoencoder network for point clouds, which outperforms the state of the art in 3D recognition tasks, and designs GAN architectures to generate novel point-clouds.
4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks
TLDR
This work creates an open-source auto-differentiation library for sparse tensors that provides extensive functions for high-dimensional convolutional neural networks and proposes the hybrid kernel, a special case of the generalized sparse convolution, and trilateral-stationary conditional random fields that enforce spatio-temporal consistency in the 7D space-time-chroma space.
Self-Supervised Deep Learning on Point Clouds by Reconstructing Space
TLDR
This work proposes a self-supervised learning task for deep learning on raw point cloud data in which a neural network is trained to reconstruct point clouds whose parts have been randomly rearranged, and demonstrates that pre-training with this method before supervised training improves the performance of state-of-the-art models and significantly improves sample efficiency.
Multi-Angle Point Cloud-VAE: Unsupervised Feature Learning for 3D Point Clouds From Multiple Angles by Joint Self-Reconstruction and Half-to-Half Prediction
TLDR
The outperforming results in four shape analysis tasks show that MAP-VAE can learn more discriminative global or local features than the state-of-the-art methods.
Cooperative Holistic Scene Understanding: Unifying 3D Object, Layout, and Camera Pose Estimation
TLDR
An end-to-end model that simultaneously solves all three tasks in real-time given only a single RGB image and significantly outperforms prior approaches on 3D object detection, 3D layout estimation,3D camera pose estimation, and holistic scene understanding is proposed.
SUN RGB-D: A RGB-D scene understanding benchmark suite
TLDR
This paper introduces an RGB-D benchmark suite for the goal of advancing the state-of-the-arts in all major scene understanding tasks, and presents a dataset that enables the train data-hungry algorithms for scene-understanding tasks, evaluate them using meaningful 3D metrics, avoid overfitting to a small testing set, and study cross-sensor bias.
PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation
TLDR
This paper designs a novel type of neural network that directly consumes point clouds, which well respects the permutation invariance of points in the input and provides a unified architecture for applications ranging from object classification, part segmentation, to scene semantic parsing.
Deep Hough Voting for 3D Object Detection in Point Clouds
TLDR
This work proposes VoteNet, an end-to-end 3D object detection network based on a synergy of deep point set networks and Hough voting that achieves state-of-the-art 3D detection on two large datasets of real 3D scans, ScanNet and SUN RGB-D with a simple design, compact model size and high efficiency.
Holistic 3D Scene Parsing and Reconstruction from a Single RGB Image
TLDR
A Holistic Scene Grammar (HSG) is introduced to represent the 3D scene structure, which characterizes a joint distribution over the functional and geometric space of indoor scenes, and significantly outperforms prior methods on 3D layout estimation, 3D object detection, and holistic scene understanding.
...
1
2
3
4
5
...