Corpus ID: 236965903

Learning to Cut by Watching Movies

  title={Learning to Cut by Watching Movies},
  author={A. Pardo and Fabian Caba Heilbron and Juan Le'on Alc'azar and Ali K. Thabet and Bernard Ghanem},
Video content creation keeps growing at an incredible pace; yet, creating engaging stories remains challenging and requires non-trivial video editing expertise. Many video editing components are astonishingly hard to automate primarily due to the lack of raw video materials. This paper focuses on a new task for computational video editing, namely the task of raking cut plausibility. Our key idea is to leverage content that has already been edited to learn fine-grained audiovisual patterns that… Expand


Watching the World Go By: Representation Learning from Unlabeled Videos
Video Noise Contrastive Estimation is proposed, a method for using unlabeled video to learn strong, transferable single image representations that demonstrate improvements over recent unsupervised single image techniques, as well as over fully supervised ImageNet pretraining, across a variety of temporal and non-temporal tasks. Expand
TVSum: Summarizing web videos using titles
A novel co-archetypal analysis technique is developed that learns canonical visual concepts shared between video and images, but not in either alone, by finding a joint-factorial representation of two data sets. Expand
Rethinking the Evaluation of Video Summaries
It turns out that the video segmentation, which is often considered as a fixed pre-processing method, has the most significant impact on the performance measure, and is proposed as an intuitive visualization of correlation between the estimated scoring and human annotations. Expand
MovieNet: A Holistic Dataset for Movie Understanding
MovieNet is the largest dataset with richest annotations for comprehensive movie understanding and it is believed that such a holistic dataset would promote the researches on story-based long video understanding and beyond. Expand
Vggsound: A Large-Scale Audio-Visual Dataset
The goal is to collect a large-scale audio-visual dataset with low label noise from videos ‘in the wild’ using computer vision techniques and investigates various Convolutional Neural Network architectures and aggregation approaches to establish audio recognition baselines for this new dataset. Expand
Ridiculously Fast Shot Boundary Detection with Fully Convolutional Neural Networks
  • Michael Gygli
  • Computer Science
  • 2018 International Conference on Content-Based Multimedia Indexing (CBMI)
  • 2018
This work proposes a Convolutional Neural Network (CNN) which is fully convolutional in time, thus allowing to use a large temporal context without the need to repeatedly processing frames. Expand
Use What You Have: Video retrieval using representations from collaborative experts
This paper proposes a collaborative experts model to aggregate information from these different pre-trained experts and assess the approach empirically on five retrieval benchmarks: MSR-VTT, LSMDC, MSVD, DiDeMo, and ActivityNet. Expand
Look, Listen and Learn
There is a valuable, but so far untapped, source of information contained in the video itself – the correspondence between the visual and the audio streams, and a novel “Audio-Visual Correspondence” learning task that makes use of this. Expand
Text-based editing of talking-head video
This work proposes a novel method to edit talking-head video based on its transcript to produce a realistic output video in which the dialogue of the speaker has been modified, while maintaining a seamless audio-visual flow (i.e. no jump cuts). Expand
A Unified Framework for Shot Type Classification Based on Subject Centric Lens
A learning framework Subject Guidance Network (SGNet) for shot type recognition is proposed, which separates the subject and background of a shot into two streams, serving as separate guidance maps for scale and movement type classification respectively. Expand