Look, Listen and Learn

@article{Arandjelovi2017LookLA,
  title={Look, Listen and Learn},
  author={Relja Arandjelovi{\'c} and Andrew Zisserman},
  journal={2017 IEEE International Conference on Computer Vision (ICCV)},
  year={2017},
  pages={609-617}
}
We consider the question: what can be learnt by looking at and listening to a large number of unlabelled videos. [] Key Result We also demonstrate that the network is able to localize objects in both modalities, as well as perform fine-grained recognition tasks.
Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning
TLDR
A novel self-supervised framework with co-attention mechanism to learn generic cross-modal representations from unlabelled videos in the wild, and further benefit downstream tasks is proposed.
Objects that Sound
TLDR
New network architectures are designed that can be trained using the AVC task for these functionalities: for cross-modal retrieval, and for localizing the source of a sound in an image.
Look, Listen and Infer
TLDR
For the first time, a Look, Listen and Infer Network (LLINet) is proposed to learn a zero- shot model that can infer the relations of visual scenes and sounds from novel categories never appeared before, indicating that zero-shot learning for visual Scenes and sounds is feasible.
LiRA: Learning Visual Speech Representations from Audio through Self-supervision
TLDR
This work trains a ResNet+Conformer model to predict acoustic features from unlabelled visual speech and finds that this pre-trained model can be leveraged towards word-level and sentence-level lip-reading through feature extraction and fine-tuning experiments.
Weakly Supervised Representation Learning for Audio-Visual Scene Analysis
TLDR
This work develops methods that identify events and localize corresponding AV cues in unconstrained videos using weak labels, and demonstrates the framework's ability to separate out the audio source of interest through a novel use of nonnegative matrix factorization.
Self-Supervised Learning of Audio-Visual Objects from Video
TLDR
This work introduces a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time, and significantly outperforms other self-supervised approaches, and obtains performance competitive with methods that use supervised face detection.
Self-Supervised MultiModal Versatile Networks
TLDR
This work learns representations using self-supervision by leveraging three modalities naturally present in videos: vision, audio and language by incorporating a novel process of deflation, so that the networks can be effortlessly applied to the visual data in the form of video or a static image.
Learning to Separate Object Sounds by Watching Unlabeled Video
TLDR
This work is the first to learn audio source separation from large-scale "in the wild" videos containing multiple audio sources per video, and obtains state-of-the-art results on visually-aided audio sources separation and audio denoising.
Grounding Spoken Words in Unlabeled Video
TLDR
Deep learning models that learn joint multi-modal embeddings in videos where the audio and visual streams are loosely synchronized are explored, and with weak supervision the authors see significant amounts of cross- modal learning.
Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings
TLDR
This paper investigates how L3-Net design choices impact the performance of downstream audio classifiers trained with these embeddings, and shows that audio-informed choices of input representation are important, and that using sufficient data for training the embedding is key.
...
...

References

SHOWING 1-10 OF 43 REFERENCES
Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification
TLDR
This paper forms an approach for learning a visual representation from the raw spatiotemporal signals in videos using a Convolutional Neural Network, and shows that this method captures information that is temporally varying, such as human pose.
Unsupervised Visual Representation Learning by Context Prediction
TLDR
It is demonstrated that the feature representation learned using this within-image context indeed captures visual similarity across images and allows us to perform unsupervised visual discovery of objects like cats, people, and even birds from the Pascal VOC 2011 detection dataset.
Zero-Shot Learning Through Cross-Modal Transfer
TLDR
This work introduces a model that can recognize objects in images even if no training data is available for the object class, and uses novelty detection methods to differentiate unseen classes from seen classes.
Context Encoders: Feature Learning by Inpainting
TLDR
It is found that a context encoder learns a representation that captures not just appearance but also the semantics of visual structures, and can be used for semantic inpainting tasks, either stand-alone or as initialization for non-parametric methods.
DeViSE: A Deep Visual-Semantic Embedding Model
TLDR
This paper presents a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text and shows that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training.
SoundNet: Learning Sound Representations from Unlabeled Video
TLDR
This work proposes a student-teacher training procedure which transfers discriminative visual knowledge from well established visual recognition models into the sound modality using unlabeled video as a bridge, and suggests some high-level semantics automatically emerge in the sound network, even though it is trained without ground truth labels.
Unsupervised Learning of Spoken Language with Visual Context
TLDR
A deep neural network model capable of rudimentary spoken language acquisition using untranscribed audio training data, whose only supervision comes in the form of contextually relevant visual images, is presented.
Learning to See by Moving
TLDR
It is found that using the same number of training images, features learnt using egomotion as supervision compare favourably to features learnt with class-label as supervision on the tasks of scene recognition, object recognition, visual odometry and keypoint matching.
Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles
TLDR
A novel unsupervised learning approach to build features suitable for object detection and classification and to facilitate the transfer of features to other tasks, the context-free network (CFN), a siamese-ennead convolutional neural network is introduced.
Predicting Deep Zero-Shot Convolutional Neural Networks Using Textual Descriptions
TLDR
A new model is presented that can classify unseen categories from their textual description and takes advantage of the architecture of CNNs and learn features at different layers, rather than just learning an embedding space for both modalities, as is common with existing approaches.
...
...