Straight to the Point: Fast-Forwarding Videos via Reinforcement Learning Using Textual Data

  title={Straight to the Point: Fast-Forwarding Videos via Reinforcement Learning Using Textual Data},
  author={W. Ramos and Michel Melo Silva and Edson R. Araujo and Leandro Soriano Marcolino and Erickson R. Nascimento},
  journal={2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
The rapid increase in the amount of published visual data and the limited time of users bring the demand for processing untrimmed videos to produce shorter versions that convey the same information. Despite the remarkable progress that has been made by summarization methods, most of them can only select a few frames or skims, which creates visual gaps and breaks the video context. In this paper, we present a novel methodology based on a reinforcement learning formulation to accelerate… 

Figures and Tables from this paper

Text-Driven Video Acceleration: A Weakly-Supervised Reinforcement Learning Method

A novel weakly-supervised methodology based on a reinforcement learning formulation to accelerate instructional videos using text and the Extended Visually-guided Document Attention Network (VDAN+), which can generate a highly discriminative embedding space to represent both textual and visual data.

Zeus: Efficiently Localizing Actions in Videos using Reinforcement Learning

Evaluation on three diverse video datasets shows that ZEUS outperforms state-of-the-art frame- and window-based filtering techniques by up to 22.1x and 4.7x, respectively and consistently meets the user-specified accuracy target across all queries.

On the impact of MDP design for Reinforcement Learning agents in Resource Management

It is shown that, in the authors' experiments, when using Multi-Layer Perceptrons as approximation function, a compact state representation allows transfer of agents between environments, and that transferred agents have good performance and outperform specialized agents in 80% of the tested scenarios, even without retraining.



Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity-Representativeness Reward

This paper forms video summarization as a sequential decision-making process and develops a deep summarization network (DSN) to summarize videos, which is comparable to or even superior than most of published supervised approaches.

A Weighted Sparse Sampling and Smoothing Frame Transition Approach for Semantic Fast-Forward First-Person Videos

A new adaptive frame selection formulated as a weighted minimum reconstruction problem is presented, which combined with a smoothing frame transition method accelerates first-person videos emphasizing the relevant segments and avoids visual discontinuities.

FFNet: Video Fast-Forwarding via Reinforcement Learning

FastForwardNet is introduced, a reinforcement learning agent that gets inspiration from video summarization and does fast-forwarding differently and is an online framework that automatically fast-forwards a video and presents a representative subset of frames to users on the fly.

Video Summarization with Long Short-Term Memory

Long Short-Term Memory (LSTM), a special type of recurrent neural networks are used to model the variable-range dependencies entailed in the task of video summarization to improve summarization by reducing the discrepancies in statistical properties across those datasets.

Video summarization by learning submodular mixtures of objectives

A new method is introduced that uses a supervised approach in order to learn the importance of global characteristics of a summary and jointly optimizes for multiple objectives and thus creates summaries that posses multiple properties of a good summary.

Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization

  • Ting YaoTao MeiY. Rui
  • Computer Science
    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2016
A novel pairwise deep ranking model that employs deep learning techniques to learn the relationship between high-light and non-highlight video segments is proposed and achieves the improvement over the state-of-the-art RankSVM method by 10.5% in terms of accuracy.

Towards Automatic Learning of Procedures From Web Instructional Videos

A segment-level recurrent network is proposed for generating procedure segments by modeling the dependencies across segments and it is shown that the proposed model outperforms competitive baselines in procedure segmentation.

Jointly Modeling Embedding and Translation to Bridge Video and Language

A novel unified framework, named Long Short-Term Memory with visual-semantic Embedding (LSTM-E), which can simultaneously explore the learning of LSTM and visual- semantic embedding and outperforms several state-of-the-art techniques in predicting Subject-Verb-Object (SVO) triplets.

Unsupervised Video Summarization with Adversarial LSTM Networks

This paper addresses the problem of unsupervised video summarization, formulated as selecting a sparse subset of video frames that optimally represent the input video, with a novel generative adversarial framework.

Predicting Visual Features From Text for Image and Video Caption Retrieval

This paper contributes Word2VisualVec, a deep neural network architecture that learns to predict a visual feature representation from textual input that generalizes Word2 visualVec for video caption retrieval, by predicting from text both three-dimensional convolutional neural network features as well as a visual-audio representation.