Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding

@article{Sigurdsson2016HollywoodIH,
  title={Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding},
  author={Gunnar A. Sigurdsson and G{\"u}l Varol and X. Wang and Ali Farhadi and Ivan Laptev and Abhinav Kumar Gupta},
  journal={ArXiv},
  year={2016},
  volume={abs/1604.01753}
}
Computer vision has a great potential to help our daily lives by searching for lost keys, watering flowers or reminding us to take a pill. To succeed with such tasks, computer vision methods need to be trained from real and diverse examples of our daily dynamic scenes. While most of such scenes are not particularly exciting, they typically do not appear on YouTube, in movies or TV broadcasts. So how do we collect sufficiently many diverse but boring samples representing our lives? We propose a… 

The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines

This paper details how this large-scale dataset was captured by 32 participants in their native kitchen environments, and densely annotated with actions and object interactions, and introduces new baselines that highlight the multimodal nature of the dataset and the importance of explicit temporal modelling to discriminate fine-grained actions.

Fine-grained Activities of People Worldwide

Collect, a free mobile app to record video while simultaneously annotating objects and activities of consented subjects, and provides activity classification and activity detection benchmarks for this dataset, and analyzes baseline results to gain insight into how people around with world perform common activities.

Out the Window: A Crowd-Sourced Dataset for Activity Classification in Surveillance Video

Performance evaluation for activity classification on VIRAT Ground 2.0 shows that the OTW dataset provides an 8.3% improvement in mean classification accuracy, and a 12.5% improvement on the most challenging activities involving people with vehicles.

From Lifestyle Vlogs to Everyday Interactions

This work starts with a large collection of interaction-rich video data and then annotate and analyze it, and uses Internet Lifestyle Vlogs as the source of surprisingly large and diverse interaction data.

Scaling Egocentric Vision: The Dataset

This paper introduces Open image in new window, a large-scale egocentric video benchmark recorded by 32 participants in their native kitchen environments and had the participants narrate their own videos (after recording), thus reflecting true intention, and crowd-sourced ground-truths based on these.

Out the Window: A Crowd-Sourced Dataset for Activity Classification in Security Video

Performance evaluation for activity classification on VIRAT Ground 2.0 shows that the OTW dataset provides an 8.3% improvement in mean classification accuracy, and a 12.5% improvement on the most challenging activities involving people with vehicles.

Efficient Localization of Human Actions and Moments in Videos

The concept of actor-supervision is introduced that exploits the inherent compositionality of actions, in terms of transformations of actors, to achieve spatiotemporal localization of actions without the need of action box annotations.

In-Home Daily-Life Captioning Using Radio Signals

RF-Diary is introduced, a new model for captioning daily life by analyzing the privacy-preserving radio signal in the home with the home's floormap, and using the ability of radio signals to capture people's 3D dynamics to help the model learn people's interactions with objects.

VidLife: A Dataset for Life Event Extraction from Videos

This work constructs a video life event extraction dataset VidLife by exploiting videos from the TV series The Big Bang Theory, in which the plot is around the daily lives of several characters.

MovieGraphs: Towards Understanding Human-Centric Situations from Videos

MovieGraphs is the first benchmark to focus on inferred properties of human-centric situations, and opens up an exciting avenue towards socially-intelligent AI agents.
...

References

SHOWING 1-10 OF 48 REFERENCES

Detecting activities of daily living in first-person camera views

This work presents a novel dataset and novel algorithms for the problem of detecting activities of daily living in firstperson camera views, and develops novel representations including temporal pyramids and composite object models that exploit the fact that objects look different when being interacted with.

HMDB: A large video database for human motion recognition

This paper uses the largest action video database to-date with 51 action categories, which in total contain around 7,000 manually annotated clips extracted from a variety of sources ranging from digitized movies to YouTube, to evaluate the performance of two representative computer vision systems for action recognition and explore the robustness of these methods under various conditions.

A large-scale benchmark dataset for event recognition in surveillance video

We introduce a new large-scale video dataset designed to assess the performance of diverse visual event recognition algorithms with a focus on continuous visual event recognition (CVER) in outdoor

Learning realistic human actions from movies

A new method for video classification that builds upon and extends several recent ideas including local space-time features,space-time pyramids and multi-channel non-linear SVMs is presented and shown to improve state-of-the-art results on the standard KTH action dataset.

ActivityNet: A large-scale video benchmark for human activity understanding

This paper introduces ActivityNet, a new large-scale video benchmark for human activity understanding that aims at covering a wide range of complex human activities that are of interest to people in their daily living.

Recognizing realistic actions from videos “in the wild”

This paper presents a systematic framework for recognizing realistic actions from videos “in the wild”, and uses motion statistics to acquire stable motion features and clean static features, and PageRank is used to mine the most informative static features.

First-Person Animal Activity Recognition from Egocentric Videos

This paper constructs a new dataset composed of first-person animal videos obtained by mounting a camera on each of the four pet dogs, and implemented multiple baseline approaches to recognize activities from such videos while utilizing multiple types of global/local motion features.

Using Descriptive Video Services to Create a Large Data Source for Video Annotation Research

An automatic DVS segmentation and alignment method for movies is described, that enables us to scale up the collection of a DVS-derived dataset with minimal human intervention.

PhotoCity: training experts at large-scale image acquisition through a competitive game

This work explores the community of photographers around the world to collaboratively acquire large-scale image collections through PhotoCity, an online game that trains its players to become "experts" at taking photos at targeted locations and in great density, for the purposes of creating 3D building models.

Actions in context

This paper automatically discover relevant scene classes and their correlation with human actions, and shows how to learn selected scene classes from video without manual supervision and develops a joint framework for action and scene recognition and demonstrates improved recognition of both in natural video.