ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning

@article{Lee2021ACAV100MAC,
  title={ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning},
  author={Sangho Lee and Jiwan Chung and Youngjae Yu and Gunhee Kim and Thomas Breuel and Gal Chechik and Yale Song},
  journal={2021 IEEE/CVF International Conference on Computer Vision (ICCV)},
  year={2021},
  pages={10254-10264}
}
  • Sangho Lee, Jiwan Chung, Yale Song
  • Published 26 January 2021
  • Computer Science
  • 2021 IEEE/CVF International Conference on Computer Vision (ICCV)
The natural association between visual observations and their corresponding sound provides powerful self-supervisory signals for learning video representations, which makes the ever-growing amount of online videos an attractive source of training data. However, large portions of online videos contain irrelevant audio-visual signals because of edited/overdubbed audio, and models trained on such uncurated videos have shown to learn suboptimal representations. Therefore, existing self-supervised… 
Robust Contrastive Learning against Noisy Views
TLDR
This work proposes a new contrastive loss function that is robust against noisy views and is completely modality-agnostic and a simple drop-in replacement for the InfoNCE loss, which makes it easy to apply to existing contrastive frameworks.
Audio-Visual MLP for Scoring Sport
TLDR
Experiments show the proposed method outperforms SOTAs over all major metrics on the public Fis-V and the authors' FS1000 dataset, and an analysis applying the method to recent competitions that occurred in Beijing 2022 Winter Olympic Games, proving the method has strong robustness.
Skating-Mixer: Multimodal MLP for Scoring Figure Skating
TLDR
Experiments show the proposed method outperforms SOTAs over all major metrics on the public Fis-V and the authors' FS1000 dataset, and an analysis applying the method to recent competitions that occurred in Beijing 2022 Winter Olympic Games, proving the method has strong robustness.

References

SHOWING 1-10 OF 85 REFERENCES
Vggsound: A Large-Scale Audio-Visual Dataset
TLDR
The goal is to collect a large-scale audio-visual dataset with low label noise from videos ‘in the wild’ using computer vision techniques and investigates various Convolutional Neural Network architectures and aggregation approaches to establish audio recognition baselines for this new dataset.
Telling Left From Right: Learning Spatial Correspondence of Sight and Sound
TLDR
This work proposes a novel self-supervised task to leverage an orthogonal principle: matching spatial information in the audio stream to the positions of sound sources in the visual stream, and demonstrates that understanding spatial correspondence enables models to perform better on three audio-visual tasks.
Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization
TLDR
It is demonstrated that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredients to obtain powerful multi-sensory representations from models optimized to discern temporal synchronization of audio-video pairs.
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
TLDR
It is demonstrated that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask.
Look, Listen and Learn
TLDR
There is a valuable, but so far untapped, source of information contained in the video itself – the correspondence between the visual and the audio streams, and a novel “Audio-Visual Correspondence” learning task that makes use of this.
Self-Supervised Learning by Cross-Modal Audio-Video Clustering
TLDR
Cross-Modal Deep Clustering (XDC), a novel self-supervised method that leverages unsupervised clustering in one modality as a supervisory signal for the other modality, is proposed, which is the first self- supervised learning method that outperforms large-scale fully- Supervised pretraining for action recognition on the same architecture.
Learning Video Representations from Textual Web Supervision
TLDR
This work proposes a data collection process and uses it to collect 70M video clips, and trains a model to pair each video with its associated text, which leads to improvements over from-scratch training on all benchmarks, and outperforms many methods for self-supervised and webly-super supervised video representation learning.
YouTube-8M: A Large-Scale Video Classification Benchmark
TLDR
YouTube-8M is introduced, the largest multi-label video classification dataset, composed of ~8 million videos (500K hours of video), annotated with a vocabulary of 4800 visual entities, and various (modest) classification models are trained on the dataset.
SoundNet: Learning Sound Representations from Unlabeled Video
TLDR
This work proposes a student-teacher training procedure which transfers discriminative visual knowledge from well established visual recognition models into the sound modality using unlabeled video as a bridge, and suggests some high-level semantics automatically emerge in the sound network, even though it is trained without ground truth labels.
Audio Set: An ontology and human-labeled dataset for audio events
TLDR
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.
...
...