Multi-modal Segment Assemblage Network for Ad Video Editing with Importance-Coherence Reward

  title={Multi-modal Segment Assemblage Network for Ad Video Editing with Importance-Coherence Reward},
  author={Yun-Qiu Tang and Siting Xu and Teng Wang and Qin Lin and Qinglin Lu and Feng Zheng},
. Advertisement video editing aims to automatically edit advertising videos into shorter videos while retaining coherent content and crucial information conveyed by advertisers. It mainly contains two stages: video segmentation and segment assemblage. The existing method performs well at video segmentation stages but suffers from the problems of dependencies on extra cumbersome models and poor performance at the segment assemblage stage. To address these problems, we propose M-SAN (Multi-modal… 



Multi-modal Representation Learning for Video Advertisement Content Structuring

A multi-modal encoder to learn multi- modal representation from video advertisements by interacting between video-audio and text is proposed and Boundary-Matching Network is applied to generate temporal proposals.

DeepQAMVS: Query-Aware Hierarchical Pointer Networks for Multi-Video Summarization

This work introduces a novel Query-Aware Hierarchical Pointer Network for Multi-Video Summarization, termed DeepQAMVS, that jointly optimizes multiple criteria: conciseness, representativeness of important query-relevant events and chronological soundness.

Transforming Multi-Concept Attention into Video Summarization

This paper proposes an novel attention-based framework for video summarization with complex video data to identify informative regions across temporal and concept video features, which jointly exploit context diversity over time and space for summarization purposes.

Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity-Representativeness Reward

This paper forms video summarization as a sequential decision-making process and develops a deep summarization network (DSN) to summarize videos, which is comparable to or even superior than most of published supervised approaches.

Video Summarization With Attention-Based Encoder–Decoder Networks

This paper proposes a novel video summarization framework named attentive encoder–decoder networks forVideo summarization (AVS), in which the encoder uses a bidirectional long short-term memory (BiLSTM) to encode the contextual information among the input video frames.

Supervised Video Summarization Via Multiple Feature Sets with Parallel Attention

A novel model architecture is suggested that combines three feature sets for visual content and motion to predict importance scores, and improves state-of-the-art results for SumMe, while being on par with the state of the art for TVSum dataset.

Extractive Video Summarizer with Memory Augmented Neural Networks

A memory augmented extractive video summarizer, which utilizes an external memory to record visual information of the whole video with high capacity and demonstrates that the global attention modeling has two advantages: good transferring ability across datasets and high robustness to noisy videos.

Condensed Movies: Story Based Retrieval with Contextual Embeddings

The Condensed Movie Dataset (CMD) is created, consisting of the key scenes from over 3K movies: each key scene is accompanied by a high level semantic description of the scene, character face tracks, and metadata about the movie.

SUSiNet: See, Understand and Summarize It

  • Petros KoutrasP. Maragos
  • Computer Science
    2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
  • 2019
From the extensive evaluation, it is observed that the multi-task network performs as well as the state-of-the-art single-task methods (or in some cases better), while it requires less computational budget than having one independent network per each task.

Stacked Memory Network for Video Summarization

A stacked memory network called SMN is proposed to explicitly model the long dependency among video frames so that redundancy could be minimized in the video summaries produced and is particularly good at capturing long temporal dependency among frames with few additional training parameters.