Unidentified Video Objects: A Benchmark for Dense, Open-World Segmentation

@article{Wang2021UnidentifiedVO,
  title={Unidentified Video Objects: A Benchmark for Dense, Open-World Segmentation},
  author={Weiyao Wang and Matt Feiszli and Heng Wang and Du Tran},
  journal={2021 IEEE/CVF International Conference on Computer Vision (ICCV)},
  year={2021},
  pages={10756-10765}
}
Current state-of-the-art object detection and segmentation methods work well under the closed-world assumption. This closed-world setting assumes that the list of object categories is available during training and deployment. However, many real-world applications require detecting or segmenting novel objects, i.e., object categories never seen during training. In this paper, we present, UVO (Unidentified Video Objects), a new benchmark for openworld class-agnostic object segmentation in videos… 

EPIC-KITCHENS VISOR Benchmark: VIdeo Segmentations and Object Relations

We introduce VISOR, a new dataset of pixel annotations and a benchmark suite for segmenting hands and active objects in egocentric video. VISOR annotates videos from EPIC-KITCHENS, which comes with a

Learning to Detect Every Thing in an Open World

TLDR
A new data augmentation method, BackErase, is developed, which pastes annotated objects on a background image sampled from a small region of the original image to avoid suppressing hidden objects.

BURST: A Benchmark for Unifying Object Recognition, Segmentation and Tracking in Video

Multiple existing benchmarks involve tracking and segmenting objects in video e.g., Video Object Segmentation (VOS) and Multi-Object Tracking and Segmentation (MOTS), but there is little interaction

FreeSOLO: Learning to Segment Objects without Annotations

TLDR
This work proposes a fully unsupervised learning method that learns class-agnostic instance segmentation without any annotations, and presents a novel localization-aware pre-training framework, where objects can be discovered from complicated scenes in an unsuper supervised manner.

Occluded Video Instance Segmentation: A Benchmark

TLDR
A simple plug-and-play module that performs temporal feature calibration to complement missing object cues caused by occlusion is presented, and a remarkable AP improvement on the OVIS dataset is obtained.

Open-World Instance Segmentation: Exploiting Pseudo Ground Truth From Learned Pairwise Affinity

TLDR
A novel approach for mask proposals, Generic Grouping Networks (GGNs), con-structed without semantic supervision is proposed, combining a local measure of pixel affinity with instance-level mask supervision, producing a training regimen designed to make the model as generic as the data diversity allows.

PIZZA: A Powerful Image-only Zero-Shot Zero-CAD Approach to 6 DoF Tracking

TLDR
A method for tracking the 6D motion of objects in RGB video sequences when neither the training images nor the 3D geometry of the objects are available, which can consider unknown objects in open world instantly, without requiring any prior information or a specific training phase.

Single-Stage Open-world Instance Segmentation with Cross-task Consistency Regularization

TLDR
This paper proposes a single-stage framework to produce a mask for each instance directly, and discovers that the proposed cross-task consistency loss can be applied to images without any annotation, lending itself to a semi-supervised learning method.

Is an Object-Centric Video Representation Beneficial for Transfer?

TLDR
The object-centric model outperforms prior video representations (both object-agnostic and object-aware), when: (1) classifying actions on unseen objects and unseen environments; (2) low-shot learning to novel classes; (3) linear probe to other downstream tasks; as well as for standard action classification.

Global Tracking Transformers

TLDR
Experiments on the challenging TAO dataset show that the framework consistently improves upon baselines that are based on pairwise association, outperforming published works by a significant 7 .

References

SHOWING 1-10 OF 54 REFERENCES

Space-Time Memory Networks for Video Object Segmentation With User Guidance

TLDR
This work considers two scenarios of user-guided segmentation: semi-supervised and interactive segmentation, and proposes a novel and unified solution by leveraging memory networks and learning to read relevant information from all available sources.

Video Instance Segmentation

TLDR
The first time that the image instance segmentation problem is extended to the video domain, and a novel algorithm called MaskTrack R-CNN is proposed for this task, which is simultaneous detection, segmentation and tracking of instances in videos.

The Pascal Visual Object Classes Challenge: A Retrospective

TLDR
A review of the Pascal Visual Object Classes challenge from 2008-2012 and an appraisal of the aspects of the challenge that worked well, and those that could be improved in future challenges.

Efficient hierarchical graph-based video segmentation

TLDR
An efficient and scalable technique for spatiotemporal segmentation of long video sequences using a hierarchical graph-based algorithm that generates high quality segmentations, which are temporally coherent with stable region boundaries, and allows subsequent applications to choose from varying levels of granularity.

A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation

TLDR
This work presents a new benchmark dataset and evaluation methodology for the area of video object segmentation, named DAVIS (Densely Annotated VIdeo Segmentation), and provides a comprehensive analysis of several state-of-the-art segmentation approaches using three complementary metrics.

Mask R-CNN

TLDR
This work presents a conceptually simple, flexible, and general framework for object instance segmentation that outperforms all existing, single-model entries on every task, including the COCO 2016 challenge winners.

Yuchen Fan

  • Dingcheng Yue, Yuchen Liang, Jianchao Yang, and T. Huang. Youtube-vos: A large-scale video object segmentation benchmark. ArXiv, abs/1809.03327
  • 2018

Supervoxel Attention Graphs for Long-Range Video Modeling

TLDR
This paper introduces an approach that reduces a video of 10 seconds to a sparse graph of only 160 feature nodes such that efficient inference in this graph produces state-of-the-art accuracy on challenging action recognition datasets.

Class-agnostic Object Detection

TLDR
This work proposes class-agnostic object detection as a new problem that focuses on detecting objects irrespective of their object-classes, and proposes a new adversarial learning framework that forces the model to exclude class-specific information from features used for predictions.

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

TLDR
Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
...