Mix3D: Out-of-Context Data Augmentation for 3D Scenes

  title={Mix3D: Out-of-Context Data Augmentation for 3D Scenes},
  author={Alexey Nekrasov and Jonas Schult and Or Litany and B. Leibe and Francis Engelmann},
  journal={2021 International Conference on 3D Vision (3DV)},
We present Mix3D, a data augmentation technique for segmenting large-scale 3D scenes. Since scene context helps reasoning about object semantics, current works focus on models with large capacity and receptive fields that can fully capture the global context of an input 3D scene. However, strong contextual priors can have detrimental implications like mistaking a pedestrian crossing the street for a car. In this work, we focus on the importance of balancing global scene context and local… 

Towards 3D Scene Understanding by Referring Synthetic Models

This paper explores how synthetic models alleviate the real scene annotation burden, i.e., taking the labelled 3D synthetic models as reference for supervision, the neural network aims to recognize specific categories of objects on a real scene scan (without scene annotation for supervision).

HyperDet3D: Learning a Scene-conditioned 3D Object Detector

This paper proposes a discriminative Multi-head Scene-specific Attention (MSA) module to dynamically control the layer parameters of the detector conditioned on the fusion of scene-conditioned knowledge and achieves state-of-the-art results on the 3D object detection benchmark of the ScanNet and SUN RGB-D datasets.

Contrastive Learning with Dynamic Correspondences for 3D Scene Understanding (Supplemental Material)

Alternative 3D Backbone vs 4D Pre-training. Our 4D pre-training can help to learn objectness priors from dynamic object movement, in contrast to multiple 3D backbones. To demonstrate this, we

Language-Grounded Indoor 3D Semantic Segmentation in the Wild

A language-driven pre-training method to encourage learned 3D features that might have limited training examples to lie close to their pre-trained text embeddings and consistently outperforms state-of-the-art 3D pre- training for 3D semantic segmentation on the proposed benchmark.

DODA: Data-Oriented Sim-to-Real Domain Adaptation for 3D Semantic Segmentation

This work proposes a DODA framework to mitigate pattern and context gaps caused by different sensing mechanisms and layout placements across domains, and surpasses existing UDA approaches by over 13% on both 3D-FRONT → ScanNet and3D- FRONT → S3DIS.

DODA: Data-oriented Sim-to-Real Domain Adaptation for 3D Indoor Semantic Segmentation

This work proposes a DODA framework to mitigate pattern and context gaps caused by different sensing mechanisms and layout placements across domains, and surpasses existing UDA approaches by over 13% on both 3D-FRONT → ScanNet and3D- FRONT → S3DIS.

Semantic Instance Segmentation of 3D Scenes Through Weak Bounding Box Supervision

This work shows that it is possible to train dense segmentation models using only weak bounding box labels, and obtains, for the first time, compelling 3D instance segmentation results.

SegGroup: Seg-Level Supervision for 3D Instance and Semantic Segmentation

It is discovered that the locations of instances matter for both instance and semantic 3D scene segmentation, and a weakly-supervised point cloud segmentation method is designed that only requires clicking on one point per instance to indicate its location for annotation.

Learning Multi-View Aggregation In the Wild for Large-Scale 3D Semantic Segmentation

This work proposes an end-to-end trainable multi-view aggregation model leveraging the viewing conditions of 3D points to merge features from images taken at arbitrary positions that can combine standard 2D and 3D networks and outperforms both 3D models operating on colorized point clouds and hybrid 2D/3D networks without requiring colorization, meshing, or true depth maps.

4DContrast: Contrastive Learning with Dynamic Correspondences for 3D Scene Understanding

A new approach to instill 4D dynamic object priors into learned 3D representations by unsupervised pre-training is presented, and it is shown that the 4D pretraining method improves downstream tasks such as object detection mAP@0.5 by 5.5% and improved performance on SUN RGB-D.



Learning Object Placement by Inpainting for Compositional Data Augmentation

This work proposes a self-learning framework that automatically generates the necessary training data without any manual labeling by detecting, cutting, and inpainting objects from an image by proposing a PlaceNet that predicts a diverse distribution of common sense locations when given a foreground object and a background scene.

Augmented Reality Meets Computer Vision: Efficient Data Generation for Urban Driving Scenes

This work proposes an alternative paradigm which combines real and synthetic data for learning semantic instance segmentation and object detection models, and introduces a novel dataset of augmented urban driving scenes with 360 degree images that are used as environment maps to create realistic lighting and reflections on rendered objects.

ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes

This work introduces ScanNet, an RGB-D video dataset containing 2.5M views in 1513 scenes annotated with 3D camera poses, surface reconstructions, and semantic segmentations, and shows that using this data helps achieve state-of-the-art performance on several 3D scene understanding tasks.

OccuSeg: Occupancy-Aware 3D Instance Segmentation

This paper defines “3D occupancy size”, as the number of voxels occupied by each instance, and OccuSeg, an occupancy-aware 3D instance segmentation scheme is proposed, which achieves state-of-theart performance on 3 real-world datasets, i.e. ScanNetV2, S3DIS and SceneNN, while maintaining high efficiency.

Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation

A systematic study of the Copy-Paste augmentation for instance segmentation where the authors randomly paste objects onto an image finds that the simple mechanism of pasting objects randomly is good enough and can provide solid gains on top of strong baselines.

Exploring Spatial Context for 3D Semantic Segmentation of Point Clouds

This paper builds upon PointNet and proposes two extensions that enlarge the receptive field over the 3D scene and evaluates the proposed strategies on challenging indoor and outdoor datasets and shows improved results in both scenarios.

Not Using the Car to See the Sidewalk — Quantifying and Controlling the Effects of Context in Classification and Segmentation

A method to quantify the sensitivity of black-box vision models to visual context by editing images to remove selected objects and measuring the response of the target models is proposed and shows that the proposed data augmentation helps these models improve the performance in out-of-context scenarios, while preserving the performance on regular data.

DOPS: Learning to Detect 3D Objects and Predict Their 3D Shapes

The core novelty of the DOPS method is a fast, single-pass architecture that both detects objects in 3D and estimates their shapes, and is able to extract shapes without access to ground-truth shape information in the target dataset.

CoReNet: Coherent 3D scene reconstruction from a single RGB image

The model is adapted to address the harder task of reconstructing multiple objects from a single image, producing a coherent reconstruction, where all objects live in a single consistent 3D coordinate frame relative to the camera and they do not intersect in 3D space.

Microsoft COCO: Common Objects in Context

We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene