Click Here: Human-Localized Keypoints as Guidance for Viewpoint Estimation

@article{Szeto2017ClickHH,
  title={Click Here: Human-Localized Keypoints as Guidance for Viewpoint Estimation},
  author={Ryan Szeto and Jason J. Corso},
  journal={2017 IEEE International Conference on Computer Vision (ICCV)},
  year={2017},
  pages={1604-1613}
}
  • Ryan Szeto, Jason J. Corso
  • Published 29 March 2017
  • Computer Science
  • 2017 IEEE International Conference on Computer Vision (ICCV)
We motivate and address a human-in-the-loop variant of the monocular viewpoint estimation task in which the location and class of one semantic object keypoint is available at test time. In order to leverage the keypoint information, we devise a Convolutional Neural Network called Click-Here CNN (CH-CNN) that integrates the keypoint information with activations from the layers that process the image. It transforms the keypoint information into a 2D map that can be used to weigh features from… 

Figures and Tables from this paper

StarMap for Category-Agnostic Keypoint and Viewpoint Estimation
TLDR
A category-agnostic keypoint representation, which combines a multi-peak heatmap for all the keypoints and their corresponding features as 3D locations in the canonical viewpoint defined for each instance, which demonstrates competitive performance in keypoint detection and localization compared to category-specific state-of-the-art methods.
Cross-Object Viewpoint Estimation via Domain Adaptation
TLDR
A framework that learns an embedding which is invariant to both synthesized-or-real domains as well as object classes is proposed, which discourage the learned embedding to encode the domain or class information by reverse the gradient during back-propagation in training.
An Appearance-and-Structure Fusion Network for Object Viewpoint Estimation
TLDR
A novel Appearance-and-Structure Fusion network, which is called ASFnet that estimates viewpoint by fusing both appearance and structure information, is proposed in this paper and outperforms state-of-the-art methods on a public PASCAL 3D+ dataset.
Semantic Part Detection via Matching: Learning to Generalize to Novel Viewpoints From Limited Training Data
TLDR
This paper presents an approach which can learn from a small annotated dataset containing a limited range of viewpoints and generalize to detect semantic parts for a much largerrange of viewpoints.
C-Reference: Improving 2D to 3D Object Pose Estimation Accuracy via Crowdsourced Joint Object Estimation
TLDR
A crowd-machine hybrid approach that jointly uses crowds' approximate measurements of multiple in-scene objects to estimate the 3D state of a single target object and can reduce errors in the target object's 3D location estimation by over 40%, while requiring only $35$% as much human time.
Semantic translation with convolutional encoder-decoder networks for viewpoint estimation
TLDR
A new pipeline of viewpoint estimation is proposed, introducing semantic translation methods to highlight the structures of interest (SOIs) as foregrounds, and a convolutional encoder-decoder network is applied as the generator of semantic segmentation.
Conservative Wasserstein Training for Pose Estimation
TLDR
This paper systematically concludes the practical closed-form solution of Wasserstein distance for pose data with either one-hot or conservative target label, especially using convex mapping function for ground metric, conservative label, and closed- form solution.
Synthetic Depth Transfer for Monocular 3D Object Pose Estimation in the Wild
TLDR
A deep convolutional neural network is proposed with an RGB-to-Depth Embedding module and a Synthetic-Real Adaptation module to extract RGB and depth features from a single RGB image with the help of synthetic RGB-depth image pairs for object pose estimation.
Adviser Networks: Learning What Question to Ask for Human-In-The-Loop Viewpoint Estimation
TLDR
This work forms a solution to the adviser problem using a deep network and applies it to the viewpoint estimation problem where the question asks for the location of a specific keypoint in the input image, and is able to outperform the previous hybrid-intelligence state-of-the-art.
Ground-truth or DAER: Selective Re-query of Secondary Information
  • Stephan J. Lemmer
  • Computer Science
    2021 IEEE/CVF International Conference on Computer Vision (ICCV)
  • 2021
TLDR
This work proposes the problem of seed rejection—determining whether to reject a seed based on the expected performance degradation when it is provided in place of a gold-standard seed, and provides a formal definition to this problem.
...
...

References

SHOWING 1-10 OF 32 REFERENCES
Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing
TLDR
A deep convolutional neural network architecture to localize semantic parts in 2D image and 3D space while inferring their visibility states, given a single RGB image is presented.
Viewpoints and keypoints
TLDR
The problem of pose estimation for rigid objects in terms of determining viewpoint to explain coarse pose and keypoint prediction to capture the finer details is characterized and it is demonstrated that leveraging viewpoint estimates can substantially improve local appearance based keypoint predictions.
Render for CNN: Viewpoint Estimation in Images Using CNNs Trained with Rendered 3D Model Views
TLDR
A scalable and overfit-resistant image synthesis pipeline, together with a novel CNN specifically tailored for the viewpoint estimation task, is proposed that can significantly outperform state-of-the-art methods on PASCAL 3D+ benchmark.
SSD: Single Shot MultiBox Detector
TLDR
The approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location, which makes SSD easy to train and straightforward to integrate into systems that require a detection component.
Single Image 3D Interpreter Network
TLDR
This work proposes 3D INterpreter Network (3D-INN), an end-to-end framework which sequentially estimates 2D keypoint heatmaps and 3D object structure, trained on both real 2D-annotated images and synthetic 3D data, and achieves state-of-the-art performance on both 2DKeypoint estimation and3D structure recovery.
Parsing IKEA Objects: Fine Pose Estimation
TLDR
This work addresses the problem of localizing and estimating the fine-pose of objects in the image with exact 3D models by using local keypoint detectors to find candidate poses and score global alignment of each candidate pose to the image.
Monocular 3D Object Detection for Autonomous Driving
TLDR
This work proposes an energy minimization approach that places object candidates in 3D using the fact that objects should be on the ground-plane, and achieves the best detection performance on the challenging KITTI benchmark, among published monocular competitors.
Best of both worlds: Human-machine collaboration for object annotation
TLDR
This paper empirically validate the effectiveness of the human-in-the-loop labeling approach on the ILSVRC2014 object detection dataset and seamlessly integrates multiple computer vision models with multiple sources of human input in a Markov Decision Process.
Click Carving: Segmenting Objects in Video with Point Clicks
TLDR
A novel form of interactive video object segmentation where a few clicks by the user helps the system produce a full spatio-temporal segmentation of the object of interest that outperforms all similarly fast methods, and is competitive or better than those requiring 2 to 12 times the effort.
Beyond PASCAL: A benchmark for 3D object detection in the wild
TLDR
PASCAL3D+ dataset is contributed, which is a novel and challenging dataset for 3D object detection and pose estimation, and on average there are more than 3,000 object instances per category.
...
...