• Publications
  • Influence
Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding
TLDR
This work proposes a novel Hollywood in Homes approach to collect data, collecting a new dataset, Charades, with hundreds of people recording videos in their own homes, acting out casual everyday activities, and evaluates and provides baseline results for several tasks including action recognition and automatic description generation. Expand
Asynchronous Temporal Fields for Action Recognition
TLDR
This work proposes a fully-connected temporal CRF model for reasoning over various aspects of activities that includes objects, actions, and intentions, where the potentials are predicted by a deep network. Expand
Learning Visual Storylines with Skipping Recurrent Neural Networks
TLDR
The novel Skipping Recurrent Neural Network model, which uses a framework that skips through the images in the photo stream to explore the space of all ordered subsets of the albums via an efficient sampling procedure, reduces the negative impact of strong short-term correlations, and recovers the latent story more accurately. Expand
Actor and Observer: Joint Modeling of First and Third-Person Videos
TLDR
Charades-Ego is introduced, a large-scale dataset of paired first-person and third-person videos, involving 112 people, with 4000 paired videos, which enables learning the link between the two, actor and observer perspectives, and addresses one of the biggest bottlenecks facing egocentric vision research. Expand
Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos
TLDR
Charades-Ego has temporal annotations and textual descriptions, making it suitable for egocentric video classification, localization, captioning, and new tasks utilizing the cross-modal nature of the data. Expand
What Actions are Needed for Understanding Human Actions in Videos?
TLDR
This work analyzes datasets, evaluation metrics, algorithms, and potential future directions to discover that fine-grained understanding of objects and pose when combined with temporal reasoning is likely to yield substantial improvements in algorithmic accuracy. Expand
How to measure snoring? A comparison of the microphone, cannula and piezoelectric sensor
TLDR
Compared to each other the methods currently recommended by the American Academy of Sleep Medicine to measure snoring: an acoustic sensor, a piezoelectric sensor and a nasal pressure transducer (cannula), the chest audio picked up the highest number of snore events of the different snore sensors. Expand
Actor and Observer: Joint Modeling of First and Third-Person Videos
TLDR
Charades-Ego is introduced, a large-scale dataset of paired first-person and third-person videos, involving 112 people, with 4000 paired videos, which enables learning the link between the two, actor and observer perspectives, and addresses one of the biggest bottlenecks facing egocentric vision research. Expand
Visual Grounding in Video for Unsupervised Word Translation
TLDR
The key idea is to establish a common visual representation between two languages by learning embeddings from unpaired instructional videos narrated in the native language, forming the basis for the proposed hybrid visual-text mapping algorithm, MUVE. Expand
Much Ado About Time: Exhaustive Annotation of Temporal Data
TLDR
This work investigates and determines the most cost-effective way of obtaining high-quality multi-label annotations for temporal data such as videos, and concludes that the optimal strategy is to ask as many questions as possible in a HIT. Expand
...
1
2
...