EPIC-KITCHENS VISOR Benchmark: VIdeo Segmentations and Object Relations

  title={EPIC-KITCHENS VISOR Benchmark: VIdeo Segmentations and Object Relations},
  author={Ahmad Darkhalil and Dandan Shan and Bin Zhu and Jian Ma and Amlan Kar and Richard E. L. Higgins and Sanja Fidler and David F. Fouhey and Dima Damen},
We introduce VISOR, a new dataset of pixel annotations and a benchmark suite for segmenting hands and active objects in egocentric video. VISOR annotates videos from EPIC-KITCHENS, which comes with a new set of challenges not encountered in current video segmentation datasets. Specifically, we need to ensure both short- and long-term consistency of pixel-level annotations as objects undergo transformative interactions, e.g. an onion is peeled, diced and cooked - where we aim to obtain accurate… 

Breaking the "Object" in Video Object Segmentation

This work collects a new dataset for Video Object Segmentation under Transformations (VOST), consisting of more than 700 high-resolution videos, cap-tured in diverse environments, which are 20 seconds long on average and densely labeled with instance masks and makes a number of important discoveries.

EgoTracks: A Long-term Egocentric Visual Object Tracking Dataset

EgoTracks is a new dataset for long-term egocentric visual object tracking, sourced from the Ego4D dataset, and presents a challenge to recent state-of-the-art single-object tracking models, which score poorly on traditional tracking metrics for this new dataset, compared to popular benchmarks.

Day2Dark: Pseudo-Supervised Activity Recognition beyond Silent Daylight

A pseudo-supervised learning scheme is introduced, which utilizes task-irrelevant unlabeled dark videos to train an activity recognizer that makes use of audio which is invariant to illumination, which enables effective activity recognition in the dark and can even improve robustness to occlusions.

Retrospectives on the Embodied AI Workshop

A retrospective on the state of Embodied AI research is presented and 13 challenges presented at the EmbodiedAI Workshop at CVPR are grouped into three themes: visual navigation, rearrangement and integration.

Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction

This work proposes TransFusion, a multimodal transformer-based architecture that effectively makes use of the representational power of language by summarizing past actions concisely, and leverages pre-trained image captioning models and summarizes the caption.



Rescaling Egocentric Vision

This paper introduces EPIC-KITCHENS-100, the largest annotated egocentric dataset - 100 hrs, 20M frames, 90K actions - of wearable videos capturing long-term unscripted activities in 45 environments, using a novel annotation pipeline that allows denser and more complete annotations of fine-grained actions.

D3D-HOI: Dynamic 3D Human-Object Interactions from Videos

This work introduces D3D-HOI: a dataset of monocular videos with ground truth annotations of 3D object pose, shape and part motion during human-object interactions, demonstrating that human- object relations can significantly reduce the ambiguity of articulated object reconstructions from challenging real-world videos.

A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation

This work presents a new benchmark dataset and evaluation methodology for the area of video object segmentation, named DAVIS (Densely Annotated VIdeo Segmentation), and provides a comprehensive analysis of several state-of-the-art segmentation approaches using three complementary metrics.

Efficient Interactive Annotation of Segmentation Datasets with Polygon-RNN++

This work follows the idea of Polygon-RNN to produce polygonal annotations of objects interactively using humans-in-the-loop and achieves a high reduction in annotation time for new datasets, moving a step closer towards an interactive annotation tool to be used in practice.

Object Instance Annotation With Deep Extreme Level Set Evolution

This paper proposes Deep Extreme Level Set Evolution that combines powerful CNN models with level set optimization in an end-to-end fashion and makes the model interactive by incorporating user clicks on the extreme boundary points, following DEXTR.

Lending A Hand: Detecting Hands and Recognizing Activities in Complex Egocentric Interactions

This work develops methods to locate and distinguish between hands in egocentric video using strong appearance models with Convolutional Neural Networks, and introduces a simple candidate region generation approach that outperforms existing techniques at a fraction of the computational cost.

Large-Scale Interactive Object Segmentation With Human Annotators

This paper systematically explores in simulation the design space of deep interactive segmentation models and reports new insights and caveats, and presents a technique for automatically estimating the quality of the produced masks which exploits indirect signals from the annotation process.

Scene Parsing through ADE20K Dataset

The ADE20K dataset, spanning diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts, is introduced and it is shown that the trained scene parsing networks can lead to applications such as image content removal and scene synthesis.

Microsoft COCO: Common Objects in Context

We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene

YouTube-VOS: A Large-Scale Video Object Segmentation Benchmark

A new large-scale video object segmentation dataset called YouTube Video Object Segmentation dataset (YouTube-VOS) is built which aims to establish baselines for the development of new algorithms in the future.