Summarizing First-Person Videos from Third Persons' Points of Views

@inproceedings{Ho2018SummarizingFV,
  title={Summarizing First-Person Videos from Third Persons' Points of Views},
  author={Hsuan-I Ho and Wei-Chen Chiu and Y. Wang},
  booktitle={ECCV},
  year={2018}
}
Video highlight or summarization is among interesting topics in computer vision, which benefits a variety of applications like viewing, searching, or storage. [] Key Method Our proposed model is realized in a semi-supervised setting, in which fully annotated third-person videos, unlabeled first-person videos, and a small number of annotated first-person ones are presented during training. In our experiments, qualitative and quantitative evaluations on both benchmarks and our collected first-person video…
Temporal U-Nets for Video Summarization with Scene and Action Recognition
TLDR
This work proposes a novel convolutional neural network architecture for handling untrimmed videos with multiple contents where the encoder captures long-term temporal dynamics from an entire video and the decoder predicts detailed temporal information of multiple contents of the video.
Generating 1 Minute Summaries of Day Long Egocentric Videos
TLDR
This paper presents a novel unsupervised reinforcement learning technique to generate video summaries from day long egocentric videos and shows that the approach generates summaries focusing on social interactions, similar to the current state-of-the-art (SOTA).
Video Summarization Using Deep Neural Networks: A Survey
TLDR
This work suggests a taxonomy of the existing algorithms and provides a systematic review of the relevant literature that shows the evolution of the deep-learning-based video summarization technologies and leads to suggestions for future developments.
GPT2MVS: Generative Pre-trained Transformer-2 for Multi-modal Video Summarization
TLDR
A new method is proposed that uses a specialized attention network and contextualized word representations to tackle the task of multi-modal video summarization and is effective with the increase of +5.88% in accuracy and +4.06% increase of F1-score, compared with the state-of-the-art method.
Weakly-Supervised Online Action Segmentation in Multi-View Instructional Videos
TLDR
A framework to segment streaming videos online at test time using Dynamic Programming and show its advantages over greedy sliding window approach and improves the framework by introducing the Online-Offline Discrepancy Loss (OODL) to encourage the segmentation results to have a higher temporal consistency.
Video Skimming
TLDR
A taxonomy of video skimming approaches is presented and their evolution highlighting key advances are discussed, and a study on the components required for the evaluation of aVideo skimming performance is provided.
Compare and Select: Video Summarization with Multi-Agent Reinforcement Learning
TLDR
Inspired by the general user behaviours, the summarization process is formed as multiple sequential decision-making processes, and the proposed Comparison-Selection Network (CoSNet) based on multi-agent reinforcement learning outperforms state-of-the-art unsupervised methods with the unsuper supervised reward and surpasses most supervised method with the complete reward.
Scene Walk: a non-photorealistic viewing tool for first-person video
TLDR
It is found that Scene Walk does allow viewers to create a more accurate and effective cognitive map of first-person video than is achieved using a conventional video browsing interface and that this model is comparable to actually walking through the original environment.
Learning Affordance Grounding from Exocentric Images
TLDR
A cross-view knowledge transfer framework is devised that extracts affordance-specific features from exocentric interactions and enhances the perception of affordance regions by preserving affordance correlation.
...
...

References

SHOWING 1-10 OF 45 REFERENCES
Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization
  • Ting Yao, Tao Mei, Y. Rui
  • Computer Science
    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2016
TLDR
A novel pairwise deep ranking model that employs deep learning techniques to learn the relationship between high-light and non-highlight video segments is proposed and achieves the improvement over the state-of-the-art RankSVM method by 10.5% in terms of accuracy.
Summarizing While Recording: Context-Based Highlight Detection for Egocentric Videos
TLDR
A context-based highlight detection method that immediately generates summaries without watching the whole video sequences and a joint approach that simultaneously optimizes the context and highlight models in an unified learning framework is developed.
Weakly Supervised Summarization of Web Videos
TLDR
Casting the problem as a weakly supervised learning problem, this work proposes a flexible deep 3D CNN architecture to learn the notion of importance using only video-level annotation, and without any human-crafted training data.
Summarization of Egocentric Videos: A Comprehensive Survey
TLDR
This paper provides the first comprehensive survey of the techniques used specifically to summarize egocentric videos, and presents a framework for first-person view summarization and compares the segmentation methods and selection algorithms used by the related work in the literature.
TVSum: Summarizing web videos using titles
TLDR
A novel co-archetypal analysis technique is developed that learns canonical visual concepts shared between video and images, but not in either alone, by finding a joint-factorial representation of two data sets.
A General Framework for Edited Video and Raw Video Summarization
TLDR
A general summarization framework for both of edited video and raw video summarization, designed to capture the properties of video summaries, including containing important people and objects, representative to the video content, no similar key-shots, diversity, and smoothness of the storyline.
Predicting Important Objects for Egocentric Video Summarization
TLDR
The proposed video summarization approach is neither camera-wearer-specific nor object-specific; that means the learned importance metric need not be trained for a given user or context, and it can predict the importance of objects and people that have never been seen previously.
Video summarization by learning submodular mixtures of objectives
TLDR
A new method is introduced that uses a supervised approach in order to learn the importance of global characteristics of a summary and jointly optimizes for multiple objectives and thus creates summaries that posses multiple properties of a good summary.
Video Summarization With Attention-Based Encoder–Decoder Networks
TLDR
This paper proposes a novel video summarization framework named attentive encoder–decoder networks forVideo summarization (AVS), in which the encoder uses a bidirectional long short-term memory (BiLSTM) to encode the contextual information among the input video frames.
Video Summarization with Long Short-Term Memory
TLDR
Long Short-Term Memory (LSTM), a special type of recurrent neural networks are used to model the variable-range dependencies entailed in the task of video summarization to improve summarization by reducing the discrepancies in statistical properties across those datasets.
...
...