Multi-modal Representation Learning for Video Advertisement Content Structuring

  title={Multi-modal Representation Learning for Video Advertisement Content Structuring},
  author={Daya Guo and Zhaoyang Zeng},
  journal={Proceedings of the 29th ACM International Conference on Multimedia},
  • Daya Guo, Zhaoyang Zeng
  • Published 4 September 2021
  • Computer Science
  • Proceedings of the 29th ACM International Conference on Multimedia
Video advertisement content structuring aims to segment a given video advertisement and label each segment on various dimensions, such as presentation form, scene, and style. Different from real-life videos, video advertisements contain sufficient and useful multi-modal content like caption and speech, which provides crucial video semantics and would enhance the structuring process. In this paper, we propose a multi-modal encoder to learn multi-modal representation from video advertisements by… 

Figures and Tables from this paper


Multi-modal Representation Learning for Short Video Understanding and Recommendation
A multi-modal representation learning method to improve the performance of recommender systems and a novel Key-Value Memory to map dense real-values into vectors, which could obtain more sufficient semantic in a nonlinear manner.
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
It is demonstrated that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask.
Active Contrastive Learning of Audio-Visual Video Representations
An active contrastive learning approach that builds an actively sampled dictionary with diverse and informative items, which improves the quality of negative samples and improves performances on tasks where there is high mutual information in the data, e.g., video classification.
BMN: Boundary-Matching Network for Temporal Action Proposal Generation
This work proposes an effective, efficient and end-to-end proposal generation method, named Boundary-Matching Network (BMN), which generates proposals with precise temporal boundaries as well as reliable confidence scores simultaneously, and can achieve state-of-the-art temporal action detection performance.
Fast Learning of Temporal Action Proposal via Dense Boundary Generator
An efficient and unified framework to generate temporal action proposals named Dense Boundary Generator (DBG), which draws inspiration from boundary-sensitive methods and implements boundary classification and action completeness regression for densely distributed proposals.
BSN: Boundary Sensitive Network for Temporal Action Proposal Generation
An effective proposal generation method, named Boundary-Sensitive Network (BSN), which adopts "local to global" fashion and significantly improves the state-of-the-art temporal action detection performance.
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification
It is shown that it is possible to replace many of the 3D convolutions by low-cost 2D convolution, suggesting that temporal representation learning on high-level “semantic” features is more useful.
The THUMOS challenge on action recognition for videos "in the wild"
The THUMOS benchmark is described in detail and an overview of data collection and annotation procedures are given, including a comprehensive empirical study evaluating the differences in action recognition between trimmed and untrimed videos, and how well methods trained on trimmed videos generalize to untrimmed videos.
CNN architectures for large-scale audio classification
  • Shawn Hershey, S. Chaudhuri, +10 authors K. Wilson
  • Computer Science, Mathematics
    2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2017
This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.
Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.