Corpus ID: 216553230

Audio-Visual Instance Discrimination with Cross-Modal Agreement

@article{Morgado2020AudioVisualID,
  title={Audio-Visual Instance Discrimination with Cross-Modal Agreement},
  author={Pedro Morgado and Nuno Vasconcelos and Ishan Misra},
  journal={ArXiv},
  year={2020},
  volume={abs/2004.12943}
}
We present a self-supervised learning approach to learn audio-visual representations from video and audio. Our method uses contrastive learning for cross-modal discrimination of video from audio and vice versa. We show that optimizing for cross-modal discrimination, rather than within-modal discrimination, is important to learn good representations from video and audio. With this simple but powerful insight, our method achieves state-of-the-art results when finetuned on action recognition tasks… Expand
Learning Speech Representations from Raw Audio by Joint Audiovisual Self-Supervision
TLDR
The method attains competitive performance with respect to existing self-supervised audio features on established isolated word classification benchmarks, and significantly outperforms other methods at learning from fewer labels. Expand
Self-Supervised MultiModal Versatile Networks
TLDR
This work learns representations using self-supervision by leveraging three modalities naturally present in videos: vision, audio and language by incorporating a novel process of deflation, so that the networks can be effortlessly applied to the visual data in the form of video or a static image. Expand
Learning Audio-Visual Representations with Active Contrastive Coding
TLDR
This paper proposes an active contrastive coding approach that builds an 'actively sampled' dictionary with diverse and informative items, which improves the quality of negative samples and achieves substantially improved results on tasks where there is high mutual information in the data. Expand
Does Visual Self-Supervision Improve Learning of Speech Representations?
TLDR
The results demonstrate the potential of visual self-supervision for audio feature learning and suggest that joint visual and audio self- supervision leads to more informative speech representations. Expand
Active Contrastive Learning of Audio-Visual Video Representations
TLDR
An active contrastive learning approach that builds an actively sampled dictionary with diverse and informative items, which improves the quality of negative samples and improves performances on tasks where there is high mutual information in the data, e.g., video classification. Expand
Cross-Modal Attention Consistency for Video-Audio Unsupervised Learning
TLDR
This paper introduces a pretext task, Cross-Modal Attention Consistency (CMAC), for exploring the bidirectional local correspondence property and demonstrates that CMAC can improve the state-of-the-art performance on both visual and audio modalities. Expand
Contrastive Learning of Global and Local Audio-Visual Representations
TLDR
This work proposes a versatile self-supervised approach to learn audio-visual representations that generalize to both the tasks which require global semantic information and the tasks that require fine-grained spatio-temporal information. Expand
Parameter Efficient Multimodal Transformers for Video Representation Learning
TLDR
This work alleviate the high memory requirement from Transformers by sharing the weights of Transformers across layers and modalities; it decomposes the Transformer into modality-specific andmodality-shared parts so that the model learns the dynamics of each modality both individually and together, and proposes a novel parameter sharing scheme based on low-rank approximation. Expand
Watching Too Much Television is Good: Self-Supervised Audio-Visual Representation Learning from Movies and TV Shows
TLDR
It is demonstrated that a simple model based on contrastive learning, trained on a collection of movies and TV shows, not only dramatically outperforms more complex methods which are trained on orders of magnitude larger uncurated datasets, but also performs very competitively with the state-of-the-art that learns from large-scale curated data. Expand
Robust Audio-Visual Instance Discrimination
TLDR
A self-supervised learning method to learn audio and video representations with a weighted contrastive learning loss that addresses the problems of audio-visual instance discrimination and improves transfer learning performance. Expand
...
1
2
3
4
...

References

SHOWING 1-10 OF 109 REFERENCES
Self-Supervised Learning by Cross-Modal Audio-Video Clustering
TLDR
Cross-Modal Deep Clustering (XDC), a novel self-supervised method that leverages unsupervised clustering in one modality as a supervisory signal for the other modality, is proposed, which is the first self- supervised learning method that outperforms large-scale fully- Supervised pretraining for action recognition on the same architecture. Expand
Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization
TLDR
It is demonstrated that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredients to obtain powerful multi-sensory representations from models optimized to discern temporal synchronization of audio-video pairs. Expand
Look, Listen and Learn
TLDR
There is a valuable, but so far untapped, source of information contained in the video itself – the correspondence between the visual and the audio streams, and a novel “Audio-Visual Correspondence” learning task that makes use of this. Expand
Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
TLDR
It is argued that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation, and it is proposed to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned. Expand
Unsupervised Visual Representation Learning by Context Prediction
TLDR
It is demonstrated that the feature representation learned using this within-image context indeed captures visual similarity across images and allows us to perform unsupervised visual discovery of objects like cats, people, and even birds from the Pascal VOC 2011 detection dataset. Expand
Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification
TLDR
This paper forms an approach for learning a visual representation from the raw spatiotemporal signals in videos using a Convolutional Neural Network, and shows that this method captures information that is temporally varying, such as human pose. Expand
Objects that Sound
TLDR
New network architectures are designed that can be trained using the AVC task for these functionalities: for cross-modal retrieval, and for localizing the source of a sound in an image. Expand
Self-Supervised Representation Learning by Rotation Feature Decoupling
  • Zeyu Feng, Chang Xu, D. Tao
  • Computer Science
  • 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2019
TLDR
A self-supervised learning method that focuses on beneficial properties of representation and their abilities in generalizing to real-world tasks and decouples the rotation discrimination from instance discrimination, which allows it to improve the rotation prediction by mitigating the influence of rotation label noise. Expand
Learning Video Representations using Contrastive Bidirectional Transformer
This paper proposes a self-supervised learning approach for video features that results in significantly improved performance on downstream tasks (such as video classification, captioning andExpand
SoundNet: Learning Sound Representations from Unlabeled Video
TLDR
This work proposes a student-teacher training procedure which transfers discriminative visual knowledge from well established visual recognition models into the sound modality using unlabeled video as a bridge, and suggests some high-level semantics automatically emerge in the sound network, even though it is trained without ground truth labels. Expand
...
1
2
3
4
5
...