Positive Sample Propagation along the Audio-Visual Event Line

  title={Positive Sample Propagation along the Audio-Visual Event Line},
  author={Jinxing Zhou and Liang Zheng and Yiran Zhong and Shijie Hao and Meng Wang},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  • Jinxing Zhou, Liang Zheng, +2 authors Meng Wang
  • Published 1 April 2021
  • Computer Science, Engineering
  • 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Visual and audio signals often coexist in natural environments, forming audio-visual events (AVEs). Given a video, we aim to localize video segments containing an AVE and identify its category. In order to learn discriminative features for a classifier, it is pivotal to identify the helpful (or positive) audio-visual segment pairs while filtering out the irrelevant ones, regardless whether they are synchronized or not. To this end, we propose a new positive sample propagation (PSP) module to… Expand

Figures and Tables from this paper

Contrastive Learning of Global-Local Video Representations
  • Shuang Ma, Zhaoyang Zeng, Daniel J. McDuff, Yale Song
  • Computer Science, Engineering
  • 2021
This work proposes to learn video representations that generalize to both the tasks which require global semantic information and the tasks that require local fine-grained spatio-temporal information, by optimizing two contrastive objectives that together encourage the model to learn global-local visual information given audio signals. Expand
Hear Me Out: Fusional Approaches for Audio Augmented Temporal Action Localization
This work is the first to jointly consider audio and video modalities for supervised TAL and experimentally shows that its schemes consistently improve performance for state of the art video-only TAL approaches. Expand
MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing
  • Jiashuo Yu, Ying Cheng, Rui-Wei Zhao, Rui Feng, Yuejie Zhang
  • Computer Science
  • ArXiv
  • 2021
Recognizing and localizing events in videos is a fundamental task for video understanding. Since events may occur in auditory and visual modalities, multimodal detailed perception is essential forExpand


Audio-Visual Event Localization in Unconstrained Videos
An audio-guided visual attention mechanism to explore audio- visual correlations, a dual multimodal residual network (DMRN) to fuse information over the two modalities, and an audio-visual distance learning network to handle the cross-modality localization are developed. Expand
Cross-Modal Relation-Aware Networks for Audio-Visual Event Localization
A relation-aware network to leverage both audio and visual information for accurate event localization and to reduce the interference brought by the background, an audio-guided spatial-channel attention module is proposed to guide the model to focus on event-relevant visual regions. Expand
Bidirectional recurrent neural networks
It is shown how the proposed bidirectional structure can be easily modified to allow efficient estimation of the conditional posterior probability of complete symbol sequences without making any explicit assumption about the shape of the distribution. Expand
Cross-Modal Attention Network for Temporal Inconsistent Audio-Visual Event Localization
Inspired by human system which puts different focuses at specific locations, time segments and media while performing multi-modality perception, this work provides an attention-based method to simulate such process and achieves state-of-the-art performance. Expand
Dual-modality Seq2Seq Network for Audio-visual Event Localization
  • Yan-Bo Lin, Yu-Jhe Li, Y. Wang
  • Computer Science
  • ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
A deep neural network named Audio-Visual sequence-to-sequence dual network (AVSDN) is proposed, which learns global and local event information in a sequence to sequence manner, which can be realized in either fully supervised or weakly supervised settings. Expand
Learning Transferable Visual Models From Natural Language Supervision
It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. Expand
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss, and it is shown that the scale of the corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme. Expand
Audiovisual Transformer with Instance Attention for Audio-Visual Event Localization
This work proposes a deep learning framework of cross-modality co-attention for video event localization that is able to produce instance-level attention, which would identify image regions at the instance level which are associated with the sound/event of interest. Expand
Large Scale Audiovisual Learning of Sounds with Weakly Labeled Data
It is advocated that sound recognition is inherently a multi-modal audiovisual task in that it is easier to differentiate sounds using both the audio and visual modalities as opposed to one or the other. Expand
Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning
A novel self-supervised framework with co-attention mechanism to learn generic cross-modal representations from unlabelled videos in the wild, and further benefit downstream tasks is proposed. Expand