Share This Author
Generating Videos with Scene Dynamics
A generative adversarial network for video with a spatio-temporal convolutional architecture that untangles the scene's foreground from the background is proposed that can generate tiny videos up to a second at full frame rate better than simple baselines.
SoundNet: Learning Sound Representations from Unlabeled Video
This work proposes a student-teacher training procedure which transfers discriminative visual knowledge from well established visual recognition models into the sound modality using unlabeled video as a bridge, and suggests some high-level semantics automatically emerge in the sound network, even though it is trained without ground truth labels.
A large-scale benchmark dataset for event recognition in surveillance video
We introduce a new large-scale video dataset designed to assess the performance of diverse visual event recognition algorithms with a focus on continuous visual event recognition (CVER) in outdoor…
VideoBERT: A Joint Model for Video and Language Representation Learning
- Chen Sun, Austin Myers, Carl Vondrick, K. Murphy, C. Schmid
- Computer ScienceIEEE/CVF International Conference on Computer…
- 3 April 2019
This work builds upon the BERT model to learn bidirectional joint distributions over sequences of visual and linguistic tokens, derived from vector quantization of video data and off-the-shelf speech recognition outputs, respectively, which can be applied directly to open-vocabulary classification.
Moments in Time Dataset: One Million Videos for Event Understanding
- Mathew Monfort, Bolei Zhou, A. Oliva
- Computer ScienceIEEE Transactions on Pattern Analysis and Machine…
- 9 January 2018
The Moments in Time dataset, a large-scale human-annotated collection of one million short videos corresponding to dynamic events unfolding within three seconds, can serve as a new challenge to develop models that scale to the level of complexity and abstract reasoning that a human processes on a daily basis.
The Sound of Pixels
- Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh H. McDermott, A. Torralba
- Computer ScienceECCV
- 9 April 2018
Qualitative results suggest the PixelPlayer model learns to ground sounds in vision, enabling applications such as independently adjusting the volume of sound sources, and experimental results show that the proposed Mix-and-Separate framework outperforms several baselines on source separation.
Anticipating Visual Representations from Unlabeled Video
- Carl Vondrick, H. Pirsiavash, A. Torralba
- Computer ScienceIEEE Conference on Computer Vision and Pattern…
- 29 April 2015
This work presents a framework that capitalizes on temporal structure in unlabeled video to learn to anticipate human actions and objects and applies recognition algorithms on the authors' predicted representation to anticipate objects and actions.
Assessing the Quality of Actions
A learning-based framework that takes steps towards assessing how well people perform actions in videos by training a regression model from spatiotemporal pose features to scores obtained from expert judges and can provide interpretable feedback on how people can improve their action.
Efficiently Scaling up Crowdsourced Video Annotation
- Carl Vondrick, Donald J. Patterson, D. Ramanan
- Computer ScienceInternational Journal of Computer Vision
- 5 September 2012
It is argued that video annotation requires specialized skill; most workers are poor annotators, mandating robust quality control protocols and an inherent trade-off between the mix of human and cloud computing used vs. the accuracy and cost of the labeling.
Where are they looking?
A deep neural network-based approach for gaze-following and a new benchmark dataset, GazeFollow, for thorough evaluation are proposed and it is shown that this approach produces reliable results, even when viewing only the back of the head.