ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning
@article{Lee2021ACAV100MAC, title={ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning}, author={Sangho Lee and Jiwan Chung and Youngjae Yu and Gunhee Kim and Thomas Breuel and Gal Chechik and Yale Song}, journal={2021 IEEE/CVF International Conference on Computer Vision (ICCV)}, year={2021}, pages={10254-10264} }
The natural association between visual observations and their corresponding sound provides powerful self-supervisory signals for learning video representations, which makes the ever-growing amount of online videos an attractive source of training data. However, large portions of online videos contain irrelevant audio-visual signals because of edited/overdubbed audio, and models trained on such uncurated videos have shown to learn suboptimal representations. Therefore, existing self-supervised…
Figures and Tables from this paper
4 Citations
Robust Contrastive Learning against Noisy Views
- Computer Science
- 2022
This work proposes a new contrastive loss function that is robust against noisy views and is completely modality-agnostic and a simple drop-in replacement for the InfoNCE loss, which makes it easy to apply to existing contrastive frameworks.
Audio-Visual MLP for Scoring Sport
- Computer ScienceArXiv
- 2022
Experiments show the proposed method outperforms SOTAs over all major metrics on the public Fis-V and the authors' FS1000 dataset, and an analysis applying the method to recent competitions that occurred in Beijing 2022 Winter Olympic Games, proving the method has strong robustness.
Skating-Mixer: Multimodal MLP for Scoring Figure Skating
- Computer Science
- 2022
Experiments show the proposed method outperforms SOTAs over all major metrics on the public Fis-V and the authors' FS1000 dataset, and an analysis applying the method to recent competitions that occurred in Beijing 2022 Winter Olympic Games, proving the method has strong robustness.
References
SHOWING 1-10 OF 85 REFERENCES
Vggsound: A Large-Scale Audio-Visual Dataset
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
The goal is to collect a large-scale audio-visual dataset with low label noise from videos ‘in the wild’ using computer vision techniques and investigates various Convolutional Neural Network architectures and aggregation approaches to establish audio recognition baselines for this new dataset.
Telling Left From Right: Learning Spatial Correspondence of Sight and Sound
- Computer Science2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2020
This work proposes a novel self-supervised task to leverage an orthogonal principle: matching spatial information in the audio stream to the positions of sound sources in the visual stream, and demonstrates that understanding spatial correspondence enables models to perform better on three audio-visual tasks.
Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization
- Computer ScienceNeurIPS
- 2018
It is demonstrated that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredients to obtain powerful multi-sensory representations from models optimized to discern temporal synchronization of audio-video pairs.
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
- Computer Science2019 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2019
It is demonstrated that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask.
Look, Listen and Learn
- Computer Science2017 IEEE International Conference on Computer Vision (ICCV)
- 2017
There is a valuable, but so far untapped, source of information contained in the video itself – the correspondence between the visual and the audio streams, and a novel “Audio-Visual Correspondence” learning task that makes use of this.
Self-Supervised Learning by Cross-Modal Audio-Video Clustering
- Computer ScienceNeurIPS
- 2020
Cross-Modal Deep Clustering (XDC), a novel self-supervised method that leverages unsupervised clustering in one modality as a supervisory signal for the other modality, is proposed, which is the first self- supervised learning method that outperforms large-scale fully- Supervised pretraining for action recognition on the same architecture.
Learning Video Representations from Textual Web Supervision
- Computer ScienceArXiv
- 2020
This work proposes a data collection process and uses it to collect 70M video clips, and trains a model to pair each video with its associated text, which leads to improvements over from-scratch training on all benchmarks, and outperforms many methods for self-supervised and webly-super supervised video representation learning.
YouTube-8M: A Large-Scale Video Classification Benchmark
- Computer ScienceArXiv
- 2016
YouTube-8M is introduced, the largest multi-label video classification dataset, composed of ~8 million videos (500K hours of video), annotated with a vocabulary of 4800 visual entities, and various (modest) classification models are trained on the dataset.
SoundNet: Learning Sound Representations from Unlabeled Video
- Computer ScienceNIPS
- 2016
This work proposes a student-teacher training procedure which transfers discriminative visual knowledge from well established visual recognition models into the sound modality using unlabeled video as a bridge, and suggests some high-level semantics automatically emerge in the sound network, even though it is trained without ground truth labels.
Audio Set: An ontology and human-labeled dataset for audio events
- Computer Science2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2017
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.