Vggsound: A Large-Scale Audio-Visual Dataset

@article{Chen2020VggsoundAL,
  title={Vggsound: A Large-Scale Audio-Visual Dataset},
  author={Honglie Chen and Weidi Xie and Andrea Vedaldi and Andrew Zisserman},
  journal={ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2020},
  pages={721-725}
}
  • Honglie Chen, Weidi Xie, Andrew Zisserman
  • Published 29 April 2020
  • Computer Science
  • ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Our goal is to collect a large-scale audio-visual dataset with low label noise from videos ‘in the wild’ using computer vision techniques. The resulting dataset can be used for training and evaluating audio recognition models. We make three contributions. First, we propose a scalable pipeline based on computer vision techniques to create an audio dataset from open-source media. Our pipeline involves obtaining videos from YouTube; using image classification algorithms to localize audio-visual… 

Figures and Tables from this paper

ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning
TLDR
This work presents an automatic dataset curation approach based on subset optimization where the objective is to maximize the mutual information between audio and visual channels in videos, and releases ACAV100M that contains 100 million videos with high audio-visual correspondence, ideal for self-supervised video representation learning.
Localizing Visual Sounds the Hard Way
TLDR
The key technical contribution is to show that, by training the network to explicitly discriminate challenging image fragments, even for images that do contain the object emitting the sound, the authors can significantly boost the localization performance by introducing a mechanism to mine hard samples and add them to a contrastive learning formulation automatically.
Audio-Visual Localization by Synthetic Acoustic Image Generation
TLDR
This work proposes to leverage the generation of synthetic acoustic images from common audio-video data for the task of audio-visual localization, using a novel deep architecture trained to reconstruct the ground truth spatialized audio data collected by a microphone array from the associated video and its corresponding monaural audio signal.
Dual Normalization Multitasking for Audio-Visual Sounding Object Localization
TLDR
A novel multitask training strategy and architecture called Dual Normalization Multitasking (DNM), which aggregates the Audio-Visual Correspondence (AVC) task and the classification task for video events into a single audiovisual similarity map is proposed.
Learning Audio-Video Modalities from Image Captions
TLDR
A new video mining pipeline is proposed which involves transferring captions from image captioning datasets to video clips with no additional manual effort, and it is shown that training a multimodal transformed based model on this data achieves competitive performance on video retrieval and video captioning.
Audio-Visual Synchronisation in the wild
TLDR
The proposed model outperforms the previous state-of-the-art by a significant margin on standard lip reading speech benchmarks, LRS2 and LRS3, and sets the first benchmark for general audio-visual synchronisation with over 160 diverse classes in the new VGG-Sound Sync video dataset.
From Semantic Categories to Fixations: A Novel Weakly-supervised Visual-auditory Saliency Detection Approach
TLDR
A novel approach in a weakly-supervised manner to alleviating the demand of large-scale training sets for visual-audio model training by using the video category tags only and proposing the selective class activation mapping (SCAM), which follows a coarse-to-fine strategy to select the most discriminative regions in the spatial-temporal-audio circumstance.
Less Can Be More: Sound Source Localization With a Classification Model
TLDR
The key contribution is to show that a simple audio-visual classification model has the ability to localize sound sources accurately and to give on par performance with state-of-the-art methods by proving that indeed "less is more".
Weakly Supervised Visual-Auditory Saliency Detection with Multigranularity Perception
TLDR
This paper promotes a novel approach in a weakly supervised manner to alleviate the demand of large-scale training sets for visual-audio model training, and distill knowledge from these regions to obtain complete new spatial-temporal-audio (STA) fixation prediction (FP) networks, enabling broad applications in cases where video tags are not available.
Motion-Augmented Self-Training for Video Recognition at Smaller Scale
TLDR
The first motion-augmented self-training regime for 3D convolutional neural network deployment on an unlabeled video collection, which outperforms alternatives for knowledge transfer by 5%-8%, video-only self-supervision by 1%-7% and semi-supervised learning by 9%-18% using the same amount of class labels.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 34 REFERENCES
Voxceleb: Large-scale speaker verification in the wild
Large-Scale Weakly Supervised Audio Classification Using Gated Convolutional Neural Network
In this paper, we present a gated convolutional neural network and a temporal attention-based localization method for audio classification, which won the 1st place in the large-scale weakly
CNN architectures for large-scale audio classification
TLDR
This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.
Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Recognition
TLDR
This work introduces a convolutional neural network (CNN) with a large input field for AED that significantly outperforms state of the art methods including Bag of Audio Words (BoAW) and classical CNNs, achieving a 16% absolute improvement.
Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Detection
TLDR
This work introduces a convolutional neural network (CNN) with a large input field for AED that significantly outperforms state of the art methods including Bag of Audio Words (BoAW) and classical CNNs, achieving a 16% absolute improvement.
VoxCeleb: A Large-Scale Speaker Identification Dataset
TLDR
This paper proposes a fully automated pipeline based on computer vision techniques to create a large scale text-independent speaker identification dataset collected 'in the wild', and shows that a CNN based architecture obtains the best performance for both identification and verification.
Look, Listen and Learn
TLDR
There is a valuable, but so far untapped, source of information contained in the video itself – the correspondence between the visual and the audio streams, and a novel “Audio-Visual Correspondence” learning task that makes use of this.
Audio Set: An ontology and human-labeled dataset for audio events
TLDR
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.
VoxCeleb2: Deep Speaker Recognition
TLDR
A very large-scale audio-visual speaker recognition dataset collected from open-source media is introduced and Convolutional Neural Network models and training strategies that can effectively recognise identities from voice under various conditions are developed and compared.
Very Deep Convolutional Networks for Large-Scale Image Recognition
TLDR
This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
...
1
2
3
4
...