Future Urban Scenes Generation Through Vehicles Synthesis

  title={Future Urban Scenes Generation Through Vehicles Synthesis},
  author={Alessandro Simoni and Luca Bergamini and Andrea Palazzi and Simone Calderara and Rita Cucchiara},
  journal={2020 25th International Conference on Pattern Recognition (ICPR)},
In this work we propose a deep learning pipeline to predict the visual future appearance of an urban scene. Despite recent advances, generating the entire scene in an end-to-end fashion is still far from being achieved. Instead, here we follow a two stages approach, where interpretable information is included in the loop and each actor is modelled independently. We leverage a per-object novel view synthesis paradigm; i.e. generating a synthetic representation of an object undergoing a… 

Figures and Tables from this paper

Improving Car Model Classification through Vehicle Keypoint Localization
A novel multi-task framework which aims to improve the performance of car model classification leveraging visual features and pose information extracted from single RGB images and shows how this approach considerably improves the performance on the model classification task testing.
Multi-Category Mesh Reconstruction From Image Collections
An alternative approach that infers the textured mesh of objects combining a series of deformable 3D models and a set of instance-specific deformation, pose, and texture, and experiments show that the proposed framework can distinguish between different object categories and learn category-specific shape priors in an unsupervised manner.


Weakly-supervised Disentangling with Recurrent Transformations for 3D View Synthesis
A novel recurrent convolutional encoder-decoder network that is trained end-to-end on the task of rendering rotated objects starting from a single image and allows the model to capture long-term dependencies along a sequence of transformations.
Transformation-Grounded Image Generation Network for Novel 3D View Synthesis
We present a transformation-grounded image generation network for novel 3D view synthesis from a single image. Our approach first explicitly infers the parts of the geometry visible both in the input
Rules of the Road: Predicting Driving Behavior With a Convolutional Model of Semantic Interactions
A unified representation is presented which encodes such high-level semantic information in a spatial grid, allowing the use of deep convolutional models to fuse complex scene context and empirically show that one can effectively learn fundamentals of driving behavior.
View Synthesis by Appearance Flow
This work addresses the problem of novel view synthesis: given an input image, synthesizing new images of the same object or scene observed from arbitrary viewpoints and shows that for both objects and scenes, this approach is able to synthesize novel views of higher perceptual quality than previous CNN-based techniques.
CarFusion: Combining Point Tracking and Part Detection for Dynamic 3D Reconstruction of Vehicles
This work develops a framework to fuse both the single-view feature tracks and multiview detected part locations to significantly improve the detection, localization and reconstruction of moving vehicles, even in the presence of strong occlusions.
A Variational U-Net for Conditional Appearance and Shape Generation
A conditional U-Net is presented for shape-guided image generation, conditioned on the output of a variational autoencoder for appearance, trained end-to-end on images, without requiring samples of the same object with varying pose or appearance.
High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs
A new method for synthesizing high-resolution photo-realistic images from semantic label maps using conditional generative adversarial networks (conditional GANs) is presented, which significantly outperforms existing methods, advancing both the quality and the resolution of deep image synthesis and editing.
Multi-View Image Generation from a Single-View
This paper proposes a novel image generation model termed VariGANs, which combines the merits of the variational inference and the Generative Adversarial Networks (GANs), and generates the target image in a coarse-to-fine manner instead of a single pass which suffers from severe artifacts.
Multi-view 3D Models from Single Images with a Convolutional Network
A convolutional network capable of inferring a 3D representation of a previously unseen object given a single image of this object and several depth maps fused together give a full point cloud of the object.
SSD: Single Shot MultiBox Detector
The approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location, which makes SSD easy to train and straightforward to integrate into systems that require a detection component.