• Corpus ID: 53643967

Unsupervised Learning of Object Landmarks through Conditional Image Generation

  title={Unsupervised Learning of Object Landmarks through Conditional Image Generation},
  author={Tomas Jakab and Ankush Gupta and Hakan Bilen and Andrea Vedaldi},
  booktitle={Neural Information Processing Systems},
We propose a method for learning landmark detectors for visual objects (such as the eyes and the nose in a face) without any manual supervision. We cast this as the problem of generating images that combine the appearance of the object as seen in a first example image with the geometry of the object as seen in a second example image, where the two examples differ by a viewpoint change and/or an object deformation. In order to factorize appearance and geometry, we introduce a tight bottleneck in… 

Figures and Tables from this paper

Learning Landmarks from Unaligned Data using Image Translation

Modifications of the landmark detection model are shown to improve the quality of the learned detector leading to state-of-the-art unsupervised landmark detection performance in a number of challenging human pose and facial landmark detection benchmarks.

BRULÉ: Barycenter-Regularized Unsupervised Landmark Extraction

On Equivariant and Invariant Learning of Object Landmark Representations

It is shown that when a deep network is trained to be invariant to geometric and photometric transformations, representations emerge from its intermediate layers that are highly predictive of object landmarks.

Unsupervised Discovery of Object Landmarks via Contrastive Learning

It is shown that when a deep network is trained to be invariant to geometric and photometric transformations, representations from its intermediate layers are highly predictive of object landmarks and by stacking representations across layers in a hypercolumn their effectiveness can be improved.

Unsupervised Landmark Learning from Unpaired Data

A cross-image cycle consistency framework is proposed which applies the swapping-reconstruction strategy twice to obtain the final supervision and is shown to outperform strong baselines by a large margin.

Self-Supervised Viewpoint Learning From Image Collections

This work proposes a novel learning framework which incorporates an analysis-by-synthesis paradigm to reconstruct images in a viewpoint aware manner with a generative network, along with symmetry and adversarial constraints to successfully supervise the authors' viewpoint estimation network.

Unsupervised Learning of Landmarks by Descriptor Vector Exchange

A new perspective on the equivariance approach is developed by noting that dense landmark detectors can be interpreted as local image descriptors equipped with invariance to intra-category variations, and proposing a direct method to enforce such an invariance in the standard equivariant loss.

Self-Supervised Learning of Interpretable Keypoints From Unlabelled Videos

A new method for recognizing the pose of objects from a single image that for learning uses only unlabelled videos and a weak empirical prior on the object poses, which achieves state-of-the-art performance among methods that do not require any labelled images for training.

Unsupervised Disentanglement of Pose, Appearance and Background from Images and Videos

The proposed factorization results in landmarks that are focused on the foreground object of interest when measured against ground-truth foreground masks, and the rendered background quality is improved as ill-suited landmarks are no longer forced to model this content.

GANSeg: Learning to Segment by Unsupervised Hierarchical Image Generation

This work proposes a GAN-based approach that generates images conditioned on latent masks, thereby alleviating full or weak annotations required by previous approaches and shows that such mask-conditioned image generation can be learned faithfully when conditioning the masks in a hierarchical manner on 2D latent points that define the position of parts explicitly.



Unsupervised object learning from dense equivariant image labelling

A new approach is proposed that, given a large number of images of an object and no other supervision, can extract a dense object-centric coordinate frame that is invariant to deformations of the images and comes with a dense equivariant labelling neural network that can map image pixels to their corresponding object coordinates.

Learning What and Where to Draw

This work proposes a new model, the Generative Adversarial What-Where Network (GAWWN), that synthesizes images given instructions describing what content to draw in which location, and shows high-quality 128 x 128 image synthesis on the Caltech-UCSD Birds dataset.

Deforming Autoencoders: Unsupervised Disentangling of Shape and Appearance

A more powerful form of unsupervised disentangling becomes possible in template coordinates, allowing us to successfully decompose face images into shading and albedo, and further manipulate face images.

Unsupervised Discovery of Object Landmarks as Structural Representations

This paper proposes an autoencoding formulation to discover landmarks as explicit structural representations, which naturally creates an unsupervised, perceptible interface to manipulate object shapes and decode images with controllable structures.

Self-supervised learning of a facial attribute embedding from video

A network is introduced that is trained to embed multiple frames from the same video face-track into a common low-dimensional space and learns a meaningful face embedding that encodes information about head pose, facial landmarks and facial expression, without having been supervised with any labelled data.

Learning Deep Representation for Face Alignment with Auxiliary Attributes

A novel tasks-constrained deep model is formulated, which not only learns the inter-task correlation but also employs dynamic task coefficients to facilitate the optimization convergence when learning multiple complex tasks.

Image-to-Image Translation with Conditional Adversarial Networks

Conditional adversarial networks are investigated as a general-purpose solution to image-to-image translation problems and it is demonstrated that this approach is effective at synthesizing photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks.

Flowing ConvNets for Human Pose Estimation in Videos

This work proposes a ConvNet architecture that is able to benefit from temporal context by combining information across the multiple frames using optical flow and outperforms a number of others, including one that uses optical flow solely at the input layers, one that regresses joint coordinates directly, and one that predicts heatmaps without spatial fusion.

Coarse-to-Fine Auto-Encoder Networks (CFAN) for Real-Time Face Alignment

This paper proposes a Coarse-to-Fine Auto-encoder Networks (CFAN) approach, which cascades a few successive Stacked Auto- Encoding Networks (SANs) so that the first SAN predicts the landmarks quickly but accurately enough as a preliminary, by taking as input a low-resolution version of the detected face holistically.

Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network

SRGAN, a generative adversarial network (GAN) for image super-resolution (SR), is presented, to its knowledge, the first framework capable of inferring photo-realistic natural images for 4x upscaling factors and a perceptual loss function which consists of an adversarial loss and a content loss.