Overview of Tencent Multi-modal Ads Video Understanding

  title={Overview of Tencent Multi-modal Ads Video Understanding},
  author={Zhenzhi Wang and Liyu Wu and Zhimin Li and Jiangfeng Xiong and Qinglin Lu},
  journal={Proceedings of the 29th ACM International Conference on Multimedia},
  • Zhenzhi Wang, Liyu Wu, +2 authors Qinglin Lu
  • Published 16 September 2021
  • Computer Science
  • Proceedings of the 29th ACM International Conference on Multimedia
Multi-modal Ads Video Understanding Challenge is the first grand challenge aiming to comprehensively understand ads videos. Our challenge includes two tasks: video structuring and multi-label classification. Video structuring asks the participants to accurately predict both the scene boundaries and the multi-label categories of each scene based on a fine-grained and ads-related category hierarchy. This task will advance the foundation of comprehensive ads video understanding, which has a… 


Multi-modal Transformer for Video Retrieval
A multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others, and a novel framework to establish state-of-the-art results for video retrieval on three datasets.
VideoMix: Rethinking Data Augmentation for Video Classification
It is shown that VideoMix lets a model learn beyond the object and scene biases and extract more robust cues for action recognition, and consistently outperforms other augmentation baselines on Kinetics and the challenging Something-Something-V2 benchmarks.
COIN: A Large-Scale Dataset for Comprehensive Instructional Video Analysis
A simple yet effective method to capture the dependencies among different steps, which can be easily plugged into conventional proposal-based action detection methods for localizing important steps in instruction videos is proposed.
A Local-to-Global Approach to Multi-Modal Movie Scene Segmentation
  • Anyi Rao, Linning Xu, +4 authors Dahua Lin
  • Computer Science
    2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2020
This work builds a large-scale video dataset MovieScenes, which contains 21K annotated scene segments from 150 movies, and proposes a local-to-global scene segmentation framework, which integrates multi-modal information across three levels, i.e. clip, segment, and movie.
Shot Contrastive Self-Supervised Learning for Scene Boundary Detection
It is shown how to apply the learned shot representation for the task of scene boundary detection to offer state-of-the-art performance on the MovieNet dataset while requiring only ~25% of the training labels, using 9× fewer model parameters and offering 7× faster runtime.
MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation
A multi-stage architecture for the temporal action segmentation task that achieves state-of-the-art results on three challenging datasets: 50Salads, Georgia Tech Egocentric Activities (GTEA), and the Breakfast dataset.
Boundary-Aware Cascade Networks for Temporal Action Segmentation
A new boundary-aware cascade network is presented by introducing a new cascading paradigm, called Stage Cascade, to enable the model to have adaptive receptive fields and more confident predictions for ambiguous frames, and a general and principled smoothing operation, termed as local barrier pooling, to aggregate local predictions by leveraging semantic boundary information.
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident.
NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video Classification
A fast and efficient network architecture, NeXtVLAD, to aggregate frame-level features into a compact feature vector for large-scale video classification, which turns out to be both effective and parameter efficient in aggregating temporal information.
Two-Stream Convolutional Networks for Action Recognition in Videos
This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.