Click Here: Human-Localized Keypoints as Guidance for Viewpoint Estimation

  title={Click Here: Human-Localized Keypoints as Guidance for Viewpoint Estimation},
  author={Ryan Szeto and Jason J. Corso},
  journal={2017 IEEE International Conference on Computer Vision (ICCV)},
  • Ryan Szeto, Jason J. Corso
  • Published 29 March 2017
  • Computer Science
  • 2017 IEEE International Conference on Computer Vision (ICCV)
We motivate and address a human-in-the-loop variant of the monocular viewpoint estimation task in which the location and class of one semantic object keypoint is available at test time. In order to leverage the keypoint information, we devise a Convolutional Neural Network called Click-Here CNN (CH-CNN) that integrates the keypoint information with activations from the layers that process the image. It transforms the keypoint information into a 2D map that can be used to weigh features from… 

Figures and Tables from this paper

StarMap for Category-Agnostic Keypoint and Viewpoint Estimation
A category-agnostic keypoint representation, which combines a multi-peak heatmap for all the keypoints and their corresponding features as 3D locations in the canonical viewpoint defined for each instance, which demonstrates competitive performance in keypoint detection and localization compared to category-specific state-of-the-art methods.
Cross-Object Viewpoint Estimation via Domain Adaptation
A framework that learns an embedding which is invariant to both synthesized-or-real domains as well as object classes is proposed, which discourage the learned embedding to encode the domain or class information by reverse the gradient during back-propagation in training.
An Appearance-and-Structure Fusion Network for Object Viewpoint Estimation
A novel Appearance-and-Structure Fusion network, which is called ASFnet that estimates viewpoint by fusing both appearance and structure information, is proposed in this paper and outperforms state-of-the-art methods on a public PASCAL 3D+ dataset.
C-Reference: Improving 2D to 3D Object Pose Estimation Accuracy via Crowdsourced Joint Object Estimation
A crowd-machine hybrid approach that jointly uses crowds' approximate measurements of multiple in-scene objects to estimate the 3D state of a single target object and can reduce errors in the target object's 3D location estimation by over 40%, while requiring only $35$% as much human time.
Conservative Wasserstein Training for Pose Estimation
This paper systematically concludes the practical closed-form solution of Wasserstein distance for pose data with either one-hot or conservative target label, especially using convex mapping function for ground metric, conservative label, and closed- form solution.
Synthetic Depth Transfer for Monocular 3D Object Pose Estimation in the Wild
A deep convolutional neural network is proposed with an RGB-to-Depth Embedding module and a Synthetic-Real Adaptation module to extract RGB and depth features from a single RGB image with the help of synthetic RGB-depth image pairs for object pose estimation.
Adviser Networks: Learning What Question to Ask for Human-In-The-Loop Viewpoint Estimation
This work forms a solution to the adviser problem using a deep network and applies it to the viewpoint estimation problem where the question asks for the location of a specific keypoint in the input image, and is able to outperform the previous hybrid-intelligence state-of-the-art.
NeMo: Neural Mesh Models of Contrastive Features for Robust 3D Pose Estimation
This work proposes to integrate deep neural networks with 3D generative representations of objects into a unified neural architecture that is termed NeMo, which learns a generative model of neural feature activations at each vertex on a dense 3D mesh.
Monocular Viewpoints Estimation for Generic Objects in the Wild
This paper proposes Viewpoint Discernibility Matrix (VDM) loss, which is a more suitable loss than the one-hot cross-entropy loss by tolerating the sub-optimal predictions and penalizing the wrong predictions on ambiguous viewpoints; and proposes Auxiliary Hierarchical Viewpoints Supervision (AHVS) method,Which is able to restrain the network to pay closer attention to the features of ambiguous viewpoints.
DAER to Reject Seeds with Dual-loss Additional Error Regression
This work proposes a novel training method and evaluation metrics for the seed rejection problem, and validate these metrics and methods on two problems which use seeds as a source of additional information: keypoint-conditioned viewpoint estimation with crowdsourced seeds and hierarchical scene classification with automated seeds.


Render for CNN: Viewpoint Estimation in Images Using CNNs Trained with Rendered 3D Model Views
A scalable and overfit-resistant image synthesis pipeline, together with a novel CNN specifically tailored for the viewpoint estimation task, is proposed that can significantly outperform state-of-the-art methods on PASCAL 3D+ benchmark.
SSD: Single Shot MultiBox Detector
The approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location, which makes SSD easy to train and straightforward to integrate into systems that require a detection component.
Parsing IKEA Objects: Fine Pose Estimation
This work addresses the problem of localizing and estimating the fine-pose of objects in the image with exact 3D models by using local keypoint detectors to find candidate poses and score global alignment of each candidate pose to the image.
Monocular 3D Object Detection for Autonomous Driving
This work proposes an energy minimization approach that places object candidates in 3D using the fact that objects should be on the ground-plane, and achieves the best detection performance on the challenging KITTI benchmark, among published monocular competitors.
Best of both worlds: Human-machine collaboration for object annotation
This paper empirically validate the effectiveness of the human-in-the-loop labeling approach on the ILSVRC2014 object detection dataset and seamlessly integrates multiple computer vision models with multiple sources of human input in a Markov Decision Process.
Beyond PASCAL: A benchmark for 3D object detection in the wild
PASCAL3D+ dataset is contributed, which is a novel and challenging dataset for 3D object detection and pose estimation, and on average there are more than 3,000 object instances per category.
3D Object Proposals for Accurate Object Class Detection
This method exploits stereo imagery to place proposals in the form of 3D bounding boxes in the context of autonomous driving and outperforms all existing results on all three KITTI object classes.
Very Deep Convolutional Networks for Large-Scale Image Recognition
This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Multiclass recognition and part localization with humans in the loop
A visual recognition system that is designed for fine-grained visual categorization that leveraging computer vision and analyzing the user responses achieves a significant average reduction in human effort over previous methods.
Teaching 3D geometry to deformable part models
This paper extends the successful discriminatively trained deformable part models to include both estimates of viewpoint and 3D parts that are consistent across viewpoints, and experimentally verify that adding 3D geometric information comes at minimal performance loss w.r.t.