• Publications
  • Influence
Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding
TLDR
We propose a novel Hollywood in Homes approach to collect a large-scale dataset of boring videos of daily activities. Expand
Asynchronous Temporal Fields for Action Recognition
TLDR
We propose a fully-connected temporal CRF model for reasoning over various aspects of activities that includes objects, actions, and intentions, where the potentials are predicted by a deep network. Expand
Learning Visual Storylines with Skipping Recurrent Neural Networks
TLDR
Skipping Recurrent Neural Network (S-RNN) uses a framework that skips through the image stream to explore the space of all ordered subsets of the albums via an efficient sampling procedure. Expand
Actor and Observer: Joint Modeling of First and Third-Person Videos
TLDR
We introduce Charades-Ego, a large-scale dataset of paired first-person and third-person videos, involving 112 people, with 4000 paired videos. Expand
Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos
TLDR
We present Charades-Ego with 68,536 activity instances in 68.8 hours of first and third-person video, making it one of the largest and most diverse egocentric datasets available. Expand
What Actions are Needed for Understanding Human Actions in Videos?
TLDR
We present the many kinds of information that will be needed to achieve substantial gains in activity understanding: objects, verbs, intent, and sequential reasoning. Expand
How to measure snoring? A comparison of the microphone, cannula and piezoelectric sensor
The objective of this study was to compare to each other the methods currently recommended by the American Academy of Sleep Medicine (AASM) to measure snoring: an acoustic sensor, a piezoelectricExpand
Much Ado About Time: Exhaustive Annotation of Temporal Data
TLDR
We investigate and determine the most cost-effective way of obtaining high-quality multi-label annotations for temporal data such as videos. Expand
Visual Grounding in Video for Unsupervised Word Translation
TLDR
We use visual grounding to improve unsupervised word mapping between languages by learning embeddings from unpaired instructional videos narrated in the native language. Expand
Actor and Observer: Joint Modeling of First and Third-Person Videos
TLDR
We introduce Charades-Ego, a large-scale dataset of paired first-person and third-person videos, involving 112 people, with 4000 paired videos. Expand
...
1
2
...