A Local-to-Global Approach to Multi-Modal Movie Scene Segmentation

@article{Rao2020ALA,
  title={A Local-to-Global Approach to Multi-Modal Movie Scene Segmentation},
  author={Anyi Rao and Linning Xu and Yu Xiong and Guodong Xu and Qingqiu Huang and Bolei Zhou and Dahua Lin},
  journal={2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2020},
  pages={10143-10152}
}
  • Anyi Rao, Linning Xu, +4 authors Dahua Lin
  • Published 6 April 2020
  • Computer Science
  • 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Scene, as the crucial unit of storytelling in movies, contains complex activities of actors and their interactions in a physical environment. Identifying the composition of scenes serves as a critical step towards semantic understanding of movies. This is very challenging – compared to the videos studied in conventional vision problems, e.g. action recognition, as scenes in movies usually contain much richer temporal structures and more complex semantic information. Towards this goal, we scale… 
MovieNet: A Holistic Dataset for Movie Understanding
TLDR
MovieNet is the largest dataset with richest annotations for comprehensive movie understanding and it is believed that such a holistic dataset would promote the researches on story-based long video understanding and beyond.
Shot Contrastive Self-Supervised Learning for Scene Boundary Detection
TLDR
It is shown how to apply the learned shot representation for the task of scene boundary detection to offer state-of-the-art performance on the MovieNet dataset while requiring only ~25% of the training labels, using 9× fewer model parameters and offering 7× faster runtime.
A Multimodal Framework for Video Ads Understanding
TLDR
A multimodal system to improve the ability of structured analysis of advertising video content, which achieved a score of 0.2470 measured in consideration of localization and prediction accuracy, ranking fourth in the 2021 TAAC final leaderboard.
Video scene detection based on link prediction using graph convolution network
TLDR
This paper proposes a novel temporal clustering method based on graph convolution network and the link transitivity of shot nodes, without involving complicated steps and prior parameter setting such as the number of clusters.
Human Mesh Recovery from Multiple Shots
TLDR
An insight that while shot changes of the same scene incur a discontinuity between frames, the 3D structure of the scene still changes smoothly is addressed, which allows us to handle frames before and after the shot change as multi-view signal that provide strong cues to recover the3D state of the actors.
Plots to Previews: Towards Automatic Movie Preview Retrieval using Publicly Available Meta-data
TLDR
This work uses 51 movies from the MovieGraph dataset, and finds that a match of the plot summaries with scene dialogues, available through subtitles, is adequate to create usable movie previews, without the need for other semantic annotations.
MovieCuts: A New Dataset and Benchmark for Cut Type Recognition
TLDR
The cut type recognition task, which requires modeling of multi-modal information, is introduced and a large-scale dataset called MovieCuts is constructed, which contains more than 170K video clips labeled among ten cut types.
Video Ads Content Structuring by Combining Scene Confidence Prediction and Tagging
TLDR
This paper proposes a two-stage method that first provides the boundaries of the scenes, and then combines a confidence score for each segmented scene and the tag classes predicted for that scene, and improves the previous baselines on the challenging “Tencent Advertisement Video” dataset.
A Unified Framework for Shot Type Classification Based on Subject Centric Lens
TLDR
A learning framework Subject Guidance Network (SGNet) for shot type recognition is proposed, which separates the subject and background of a shot into two streams, serving as separate guidance maps for scale and movement type classification respectively.
Deep Relationship Analysis in Video with Multimodal Feature Fusion
TLDR
A novel multimodal feature fusion method based on scene segmentation to detect the relationships between entities in a long duration video to perform well for deep video understanding on the HLVU dataset.
...
1
2
3
...

References

SHOWING 1-10 OF 31 REFERENCES
A Novel Role-Based Movie Scene Segmentation Method
TLDR
The main novelty of this approach is that it converts the movie scene segmentation into a movie-script alignment problem and proposes a HMM alignment algorithm to map the script scene structure to the movie content.
Scene detection in Hollywood movies and TV shows
  • Z. Rasheed, M. Shah
  • Computer Science
    2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings.
  • 2003
TLDR
A novel two-pass algorithm for scene boundary detection is proposed, which utilizes the motion content, shot length and color properties of shots as the features, and a method to describe the content of each scene by selecting one representative image is proposed.
Detection and representation of scenes in videos
TLDR
A novel approach for clustering shots into scenes by transforming this task into a graph partitioning problem, which automates this objective which is useful for applications such as video-on-demand, digital libraries, and the Internet.
Temporal Video Segmentation to Scenes Using High-Level Audiovisual Features
TLDR
Improved performance of the proposed approach in comparison to other unimodal and multimodal techniques of the relevant literature is demonstrated and the contribution of high-level audiovisual features toward improved video segmentation to scenes is highlighted.
Optimal Sequential Grouping for Robust Video Scene Detection Using Multiple Modalities
TLDR
A novel and effective method for temporal grouping of scenes using an arbitrary set of features computed from the video, which results in a temporally consistent segmentation, and has the advantage of being parameter-free, making it applicable across various domains.
Scene Detection in Videos Using Shot Clustering and Sequence Alignment
TLDR
To clusters the shots into groups based only on their visual similarity and a label is assigned to each shot according to the group that it belongs to, a sequence alignment algorithm is applied to detect when the pattern of shot labels changes, providing the final scene segmentation result.
AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions
  • C. Gu, Chen Sun, +8 authors J. Malik
  • Computer Science
    2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
  • 2018
TLDR
The AVA dataset densely annotates 80 atomic visual actions in 437 15-minute video clips, where actions are localized in space and time, resulting in 1.59M action labels with multiple labels per person occurring frequently.
Video scene segmentation using a novel boundary evaluation criterion and dynamic programming
  • Bo Han, Weiguo Wu
  • Computer Science
    2011 IEEE International Conference on Multimedia and Expo
  • 2011
TLDR
A novel boundary evaluation criterion is proposed, including the multiple normalized min-max cut scores, which consider not only neighboring but non-neighboring scene similarities with a memory-fading model, and the maximal cross-boundary strict shot similarity, which considers both color and structure similarities.
Exploring video structure beyond the shots
TLDR
An effective approach for video scene structure construction is presented, in which shots are grouped into semantic-related scenes and the output of the proposed algorithm provides a structured video that greatly facilitates user's access.
MovieGraphs: Towards Understanding Human-Centric Situations from Videos
TLDR
MovieGraphs is the first benchmark to focus on inferred properties of human-centric situations, and opens up an exciting avenue towards socially-intelligent AI agents.
...
1
2
3
4
...