Look, Listen and Learn

@article{Arandjelovi2017LookLA,
  title={Look, Listen and Learn},
  author={Relja Arandjelovi{\'c} and Andrew Zisserman},
  journal={2017 IEEE International Conference on Computer Vision (ICCV)},
  year={2017},
  pages={609-617}
}
We consider the question: what can be learnt by looking at and listening to a large number of unlabelled videos. [] Key Result We also demonstrate that the network is able to localize objects in both modalities, as well as perform fine-grained recognition tasks.
LiRA: Learning Visual Speech Representations from Audio through Self-supervision
TLDR
This work trains a ResNet+Conformer model to predict acoustic features from unlabelled visual speech and finds that this pre-trained model can be leveraged towards word-level and sentence-level lip-reading through feature extraction and fine-tuning experiments.
Weakly Supervised Representation Learning for Audio-Visual Scene Analysis
TLDR
This work develops methods that identify events and localize corresponding AV cues in unconstrained videos using weak labels, and demonstrates the framework's ability to separate out the audio source of interest through a novel use of nonnegative matrix factorization.
Self-Supervised MultiModal Versatile Networks
TLDR
This work learns representations using self-supervision by leveraging three modalities naturally present in videos: vision, audio and language by incorporating a novel process of deflation, so that the networks can be effortlessly applied to the visual data in the form of video or a static image.
Grounding Spoken Words in Unlabeled Video
TLDR
Deep learning models that learn joint multi-modal embeddings in videos where the audio and visual streams are loosely synchronized are explored, and with weak supervision the authors see significant amounts of cross- modal learning.
Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications
TLDR
This work presents a novel unsupervised algorithm to address the problem of localizing sound sources in visual scenes, and extends this proposed algorithm to a new application, sound saliency based automatic camera view panning in 360 degree videos.
Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing
TLDR
Experimental results show that the challenging audio-visual video parsing can be achieved even with only video-level weak labels, and the proposed framework can effectively leverage unimodal and cross-modal temporal contexts and alleviate modality bias and noisy labels problems.
Self-supervised Co-training for Video Representation Learning
TLDR
This paper investigates the benefit of adding semantic-class positives to instance-based Info Noise Contrastive Estimation (InfoNCE) training, and proposes a novel self-supervised co-training scheme to improve the popular infoNCE loss.
See the Sound, Hear the Pixels
TLDR
A novel algorithm is proposed that addresses the problem of localizing sound source in unconstrained videos, which uses efficient fusion and attention mechanisms and demonstrates a significant increase in performance over the existing state-of-the-art methods.
Weakly-Supervised Audio-Visual Video Parsing Toward Unified Multisensory Perception
TLDR
This work forms the weakly-supervised audio-visual video parsing as a Multimodal Multiple Instance Learning (MMIL) problem and proposes a new framework to solve it, and develops an attentive MMIL pooling method for adaptively aggregating useful audio and visual content from different temporal extent and modalities.
Audio Self-supervised Learning: A Survey
TLDR
An overview of the SSL methods used for audio and speech processing applications, the empirical works that exploit the audio modality in multimodal SSL frameworks, and the existing suitable benchmarks to evaluate the power of SSL in the computer audition domain are summarized.
...
...

References

SHOWING 1-10 OF 45 REFERENCES
Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification
TLDR
This paper forms an approach for learning a visual representation from the raw spatiotemporal signals in videos using a Convolutional Neural Network, and shows that this method captures information that is temporally varying, such as human pose.
Context Encoders: Feature Learning by Inpainting
TLDR
It is found that a context encoder learns a representation that captures not just appearance but also the semantics of visual structures, and can be used for semantic inpainting tasks, either stand-alone or as initialization for non-parametric methods.
DeViSE: A Deep Visual-Semantic Embedding Model
TLDR
This paper presents a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text and shows that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training.
SoundNet: Learning Sound Representations from Unlabeled Video
TLDR
This work proposes a student-teacher training procedure which transfers discriminative visual knowledge from well established visual recognition models into the sound modality using unlabeled video as a bridge, and suggests some high-level semantics automatically emerge in the sound network, even though it is trained without ground truth labels.
Unsupervised Learning of Spoken Language with Visual Context
TLDR
A deep neural network model capable of rudimentary spoken language acquisition using untranscribed audio training data, whose only supervision comes in the form of contextually relevant visual images, is presented.
Learning to See by Moving
TLDR
It is found that using the same number of training images, features learnt using egomotion as supervision compare favourably to features learnt with class-label as supervision on the tasks of scene recognition, object recognition, visual odometry and keypoint matching.
Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles
TLDR
A novel unsupervised learning approach to build features suitable for object detection and classification and to facilitate the transfer of features to other tasks, the context-free network (CFN), a siamese-ennead convolutional neural network is introduced.
Ambient Sound Provides Supervision for Visual Learning
TLDR
This work trains a convolutional neural network to predict a statistical summary of the sound associated with a video frame, and shows that this representation is comparable to that of other state-of-the-art unsupervised learning methods.
Data-dependent Initializations of Convolutional Neural Networks
TLDR
This work presents a fast and simple data-dependent initialization procedure, that sets the weights of a network such that all units in the network train at roughly the same rate, avoiding vanishing or exploding gradients.
WaveNet: A Generative Model for Raw Audio
TLDR
WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.
...
...