Efficient Object Annotation via Speaking and Pointing

  title={Efficient Object Annotation via Speaking and Pointing},
  author={Michael Gygli and Vittorio Ferrari},
  journal={International Journal of Computer Vision},
Deep neural networks deliver state-of-the-art visual recognition, but they rely on large datasets, which are time-consuming to annotate. These datasets are typically annotated in two stages: (1) determining the presence of object classes at the image level and (2) marking the spatial extent for all objects of these classes. In this work we use speech, together with mouse inputs, to speed up this process. We first improve stage one, by letting annotators indicate object class presence via speech… Expand
Pointly-Supervised Instance Segmentation
The existing instance segmentation models developed for full mask supervision, like Mask RCNN, can be seamlessly trained with the point-based annotation without any major modifications, making highquality instance segmentations more accessible for new data. Expand
Self-Supervised Learning to Detect Key Frames in Videos
The method comprises a two-stream ConvNet and a novel automatic annotation architecture able to reliably annotate key frames in a video for self-supervised learning of the ConvNet, which learns deep appearance and motion features to detect frames that are unique. Expand
Heuristics2Annotate: Efficient Annotation of Large-Scale Marathon Dataset For Bounding Box Regression
The proposed framework of annotation reduces the annotation cost of the dataset by a factor of 16x, also effectively aligning 93.64% of the runners in the cross-camera setting and introduces a novel way of aligning the identity of runners in disjoint cameras. Expand
The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines
This paper details how this large-scale dataset was captured by 32 participants in their native kitchen environments, and densely annotated with actions and object interactions, and introduces new baselines that highlight the multimodal nature of the dataset and the importance of explicit temporal modelling to discriminate fine-grained actions. Expand
Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100
This paper introduces the pipeline to extend the largest dataset in egocentric vision, EPIC-KITCHENS. The effort culminates in EPIC-KITCHENS-100, a collection of 100 hours, 20M frames, 90K actions inExpand
Connecting Vision and Language with Localized Narratives
An extensive analysis of Localized Narratives is provided showing they are diverse, accurate, and efficient to produce and their utility on the application of controlled image captioning is demonstrated. Expand
Rescaling Egocentric Vision
This paper introduces EPIC-KITCHENS-100, the largest annotated egocentric dataset - 100 hrs, 20M frames, 90K actions - of wearable videos capturing long-term unscripted activities in 45 environments, using a novel annotation pipeline that allows denser and more complete annotations of fine-grained actions. Expand


Fast Object Class Labelling via Speech
  • Michael Gygli, V. Ferrari
  • Computer Science
  • 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2019
This work proposes a new interface where classes are annotated via speech, which is fast and allows for direct access to the class name, without searching through a list or hierarchy, and yields high-quality annotations at 2.3x −14.9x less annotation time than existing methods. Expand
Extreme Clicking for Efficient Object Annotation
This work proposes extreme clicking: asking the annotator to click on four physical points on the object: the top, bottom, left- and right-most points, which is more natural and these points are easy to find and not only yields box coordinates, but also four accurate boundary points. Expand
Where are the Blobs: Counting by Localization with Point Supervision
This work proposes a detection-based method that does not need to estimate the size and shape of the objects and that outperforms regression-based methods and even outperforms those that use stronger supervision such as depth features, multi-point annotations, and bounding-box labels. Expand
Spot On: Action Localization from Pointly-Supervised Proposals
An overlap measure between action proposals and points is introduced and incorporated into the objective of a non-convex Multiple Instance Learning optimization and shows that the approach is competitive to the state-of-the-art. Expand
Training Object Class Detectors with Click Supervision
This paper greatly reduces annotation time by proposing center-click annotations: it asks annotators to click on the center of an imaginary bounding box which tightly encloses the object instance. Expand
Microsoft COCO: Common Objects in Context
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of sceneExpand
Best of both worlds: Human-machine collaboration for object annotation
This paper empirically validate the effectiveness of the human-in-the-loop labeling approach on the ILSVRC2014 object detection dataset and seamlessly integrates multiple computer vision models with multiple sources of human input in a Markov Decision Process. Expand
Object-Centric Spatial Pooling for Image Classification
A framework that learns object detectors using only image-level class labels, or so-called weak labels is proposed, comparable in accuracy with state-of-the-art weakly supervised detection methods and significantly outperforms SPM-based pooling in image classification. Expand
Object Referring in Visual Scene with Spoken Language
This paper investigates Object Referring with Spoken Language (ORSpoken) by presenting two datasets and one novel approach, showing the efficacy of the proposed vision-language interaction methods in counteracting background noise. Expand
What's the Point: Semantic Segmentation with Point Supervision
This work takes a natural step from image-level annotation towards stronger supervision: it asks annotators to point to an object if one exists, and incorporates this point supervision along with a novel objectness potential in the training loss function of a CNN model. Expand