Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding

@article{Sigurdsson2016HollywoodIH,
  title={Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding},
  author={Gunnar A. Sigurdsson and G{\"u}l Varol and X. Wang and Ali Farhadi and I. Laptev and Abhinav Gupta},
  journal={ArXiv},
  year={2016},
  volume={abs/1604.01753}
}
Computer vision has a great potential to help our daily lives by searching for lost keys, watering flowers or reminding us to take a pill. To succeed with such tasks, computer vision methods need to be trained from real and diverse examples of our daily dynamic scenes. While most of such scenes are not particularly exciting, they typically do not appear on YouTube, in movies or TV broadcasts. So how do we collect sufficiently many diverse but boring samples representing our lives? We propose a… Expand
The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines
TLDR
This paper details how this large-scale dataset was captured by 32 participants in their native kitchen environments, and densely annotated with actions and object interactions, and introduces new baselines that highlight the multimodal nature of the dataset and the importance of explicit temporal modelling to discriminate fine-grained actions. Expand
Scaling Egocentric Vision: The EPIC-KITCHENS Dataset
TLDR
This paper introduces EPIC-KITCHENS, a large-scale egocentric video benchmark recorded by 32 participants in their native kitchen environments, and had the participants narrate their own videos (after recording), thus reflecting true intention, and crowd-sourced ground-truths based on these. Expand
From Lifestyle Vlogs to Everyday Interactions
TLDR
This work starts with a large collection of interaction-rich video data and then annotate and analyze it, and uses Internet Lifestyle Vlogs as the source of surprisingly large and diverse interaction data. Expand
Out the Window: A Crowd-Sourced Dataset for Activity Classification in Surveillance Video
TLDR
Performance evaluation for activity classification on VIRAT Ground 2.0 shows that the OTW dataset provides an 8.3% improvement in mean classification accuracy, and a 12.5% improvement on the most challenging activities involving people with vehicles. Expand
Scaling Egocentric Vision: The Dataset
TLDR
This paper introduces Open image in new window, a large-scale egocentric video benchmark recorded by 32 participants in their native kitchen environments and had the participants narrate their own videos (after recording), thus reflecting true intention, and crowd-sourced ground-truths based on these. Expand
Out the Window: A Crowd-Sourced Dataset for Activity Classification in Security Video
TLDR
Performance evaluation for activity classification on VIRAT Ground 2.0 shows that the OTW dataset provides an 8.3% improvement in mean classification accuracy, and a 12.5% improvement on the most challenging activities involving people with vehicles. Expand
Efficient Localization of Human Actions and Moments in Videos
TLDR
The concept of actor-supervision is introduced that exploits the inherent compositionality of actions, in terms of transformations of actors, to achieve spatiotemporal localization of actions without the need of action box annotations. Expand
In-Home Daily-Life Captioning Using Radio Signals
TLDR
RF-Diary is introduced, a new model for captioning daily life by analyzing the privacy-preserving radio signal in the home with the home's floormap, and using the ability of radio signals to capture people's 3D dynamics to help the model learn people's interactions with objects. Expand
MovieGraphs: Towards Understanding Human-Centric Situations from Videos
TLDR
MovieGraphs is the first benchmark to focus on inferred properties of human-centric situations, and opens up an exciting avenue towards socially-intelligent AI agents. Expand
APES: Audiovisual Person Search in Untrimmed Video
TLDR
This paper presents the Audiovisual Person Search dataset (APES), a new dataset composed of untrimmed videos whose audio (voices) and visual streams are densely annotated and shows that modeling audiovisUAL cues benefits the recognition of people’s identities. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 56 REFERENCES
Detecting activities of daily living in first-person camera views
TLDR
This work presents a novel dataset and novel algorithms for the problem of detecting activities of daily living in firstperson camera views, and develops novel representations including temporal pyramids and composite object models that exploit the fact that objects look different when being interacted with. Expand
HMDB: A large video database for human motion recognition
TLDR
This paper uses the largest action video database to-date with 51 action categories, which in total contain around 7,000 manually annotated clips extracted from a variety of sources ranging from digitized movies to YouTube, to evaluate the performance of two representative computer vision systems for action recognition and explore the robustness of these methods under various conditions. Expand
Much Ado About Time: Exhaustive Annotation of Temporal Data
TLDR
This work investigates and determines the most cost-effective way of obtaining high-quality multi-label annotations for temporal data such as videos, and concludes that the optimal strategy is to ask as many questions as possible in a HIT. Expand
A large-scale benchmark dataset for event recognition in surveillance video
We introduce a new large-scale video dataset designed to assess the performance of diverse visual event recognition algorithms with a focus on continuous visual event recognition (CVER) in outdoorExpand
Learning realistic human actions from movies
TLDR
A new method for video classification that builds upon and extends several recent ideas including local space-time features,space-time pyramids and multi-channel non-linear SVMs is presented and shown to improve state-of-the-art results on the standard KTH action dataset. Expand
ActivityNet: A large-scale video benchmark for human activity understanding
TLDR
This paper introduces ActivityNet, a new large-scale video benchmark for human activity understanding that aims at covering a wide range of complex human activities that are of interest to people in their daily living. Expand
Recognizing realistic actions from videos “in the wild”
TLDR
This paper presents a systematic framework for recognizing realistic actions from videos “in the wild”, and uses motion statistics to acquire stable motion features and clean static features, and PageRank is used to mine the most informative static features. Expand
First-Person Animal Activity Recognition from Egocentric Videos
TLDR
This paper constructs a new dataset composed of first-person animal videos obtained by mounting a camera on each of the four pet dogs, and implemented multiple baseline approaches to recognize activities from such videos while utilizing multiple types of global/local motion features. Expand
Using Descriptive Video Services to Create a Large Data Source for Video Annotation Research
TLDR
An automatic DVS segmentation and alignment method for movies is described, that enables us to scale up the collection of a DVS-derived dataset with minimal human intervention. Expand
PhotoCity: training experts at large-scale image acquisition through a competitive game
TLDR
This work explores the community of photographers around the world to collaboratively acquire large-scale image collections through PhotoCity, an online game that trains its players to become "experts" at taking photos at targeted locations and in great density, for the purposes of creating 3D building models. Expand
...
1
2
3
4
5
...