• Corpus ID: 2915490

SoundNet: Learning Sound Representations from Unlabeled Video

@inproceedings{Aytar2016SoundNetLS,
  title={SoundNet: Learning Sound Representations from Unlabeled Video},
  author={Yusuf Aytar and Carl Vondrick and Antonio Torralba},
  booktitle={NIPS},
  year={2016}
}
We learn rich natural sound representations by capitalizing on large amounts of unlabeled sound data collected in the wild. We leverage the natural synchronization between vision and sound to learn an acoustic representation using two-million unlabeled videos. Unlabeled video has the advantage that it can be economically acquired at massive scales, yet contains useful signals about natural sound. We propose a student-teacher training procedure which transfers discriminative visual knowledge… 

Figures and Tables from this paper

The Sound of Pixels
TLDR
Qualitative results suggest the PixelPlayer model learns to ground sounds in vision, enabling applications such as independently adjusting the volume of sound sources, and experimental results show that the proposed Mix-and-Separate framework outperforms several baselines on source separation.
Towards Learning Semantic Audio Representations from Unlabeled Data
TLDR
This work considers several class-agnostic semantic constraints that are inherent to non-speech audio and applies them to sample training data for triplet-loss embedding models using a large unlabeled dataset of YouTube soundtracks to learn semantically structured audio representations.
Learning to Separate Object Sounds by Watching Unlabeled Video
TLDR
This work is the first to learn audio source separation from large-scale "in the wild" videos containing multiple audio sources per video, and obtains state-of-the-art results on visually-aided audio sources separation and audio denoising.
Generating Visually Aligned Sound From Videos
TLDR
This work introduces an innovative audio forwarding regularizer that directly considers the real sound as input and outputs bottlenecked sound features and can control the irrelevant sound component and thus prevent the model from learning an incorrect mapping between video frames and sound emitted by the object that is out of the screen.
Grounding Spoken Words in Unlabeled Video
TLDR
Deep learning models that learn joint multi-modal embeddings in videos where the audio and visual streams are loosely synchronized are explored, and with weak supervision the authors see significant amounts of cross- modal learning.
Large Scale Audiovisual Learning of Sounds with Weakly Labeled Data
TLDR
It is advocated that sound recognition is inherently a multi-modal audiovisual task in that it is easier to differentiate sounds using both the audio and visual modalities as opposed to one or the other.
See, Hear, and Read: Deep Aligned Representations
TLDR
This work utilizes large amounts of readily-available, synchronous data to learn a deep discriminative representations shared across three major natural modalities: vision, sound and language, and jointly train a deep convolutional network for aligned representation learning.
Audio-Visual Model Distillation Using Acoustic Images
TLDR
This paper exploits a new multimodal labeled action recognition dataset acquired by a hybrid audio-visual sensor that provides RGB video, raw audio signals, and spatialized acoustic data, also known as acoustic images, where the visual and acoustic images are aligned in space and synchronized in time.
VideoBERT: A Joint Model for Video and Language Representation Learning
TLDR
This work builds upon the BERT model to learn bidirectional joint distributions over sequences of visual and linguistic tokens, derived from vector quantization of video data and off-the-shelf speech recognition outputs, respectively, which can be applied directly to open-vocabulary classification.
Deep Aligned Representations
TLDR
This work utilizes large amounts of readily-available, synchronous data to learn a deep discriminative representations shared across three major natural modalities: vision, sound and language, and jointly train a deep convolutional network for aligned representation learning.
...
...

References

SHOWING 1-10 OF 45 REFERENCES
Unsupervised feature learning for audio classification using convolutional deep belief networks
In recent years, deep learning approaches have gained significant interest as a way of building hierarchical representations from unlabeled data. However, to our knowledge, these deep learning
Generating Videos with Scene Dynamics
TLDR
A generative adversarial network for video with a spatio-temporal convolutional architecture that untangles the scene's foreground from the background is proposed that can generate tiny videos up to a second at full frame rate better than simple baselines.
Anticipating Visual Representations from Unlabeled Video
TLDR
This work presents a framework that capitalizes on temporal structure in unlabeled video to learn to anticipate human actions and objects and applies recognition algorithms on the authors' predicted representation to anticipate objects and actions.
Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots
TLDR
Applying an approach to learn action categories from static images that leverages prior observations of generic human motion to augment its training process, it enhances a state-of-the-art technique when very few labeled training examples are available.
CNN architectures for large-scale audio classification
TLDR
This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.
Audio-visual deep learning for noise robust speech recognition
TLDR
This work uses DBNs for audio-visual speech recognition; in particular, it uses deep learning from audio and visual features for noise robust speech recognition and test two methods for using DBN’s in a multimodal setting.
Learning Aligned Cross-Modal Representations from Weakly Aligned Data
TLDR
The experiments suggest that the scene representation can help transfer representations across modalities for retrieval and the visualizations suggest that units emerge in the shared representation that tend to activate on consistent concepts independently of the modality.
Polyphonic sound event detection using multi label deep neural networks
TLDR
Frame-wise spectral-domain features are used as inputs to train a deep neural network for multi label classification in this work and the proposed method improves the accuracy by 19% percentage points overall.
Comparing time and frequency domain for audio event recognition using deep learning
TLDR
The results show that feature learning from the frequency domain is superior to the time domain and using convolution and pooling layers, to explore local structures of the audio signal, significantly improves the recognition performance and achieves state-of-the-art results.
Multimodal Deep Learning
TLDR
This work presents a series of tasks for multimodal learning and shows how to train deep networks that learn features to address these tasks, and demonstrates cross modality feature learning, where better features for one modality can be learned if multiple modalities are present at feature learning time.
...
...