• Corpus ID: 239616434

Wav2CLIP: Learning Robust Audio Representations From CLIP

  title={Wav2CLIP: Learning Robust Audio Representations From CLIP},
  author={Ho-Hsiang Wu and Prem Seetharaman and Kundan Kumar and Juan Pablo Bello},
We propose Wav2CLIP, a robust audio representation learning method by distilling from Contrastive Language-Image Pre-training (CLIP). We systematically evaluate Wav2CLIP on a variety of audio tasks including classification, retrieval, and generation, and show that Wav2CLIP can outperform several publicly available pretrained audio representation algorithms. Wav2CLIP projects audio into a shared embedding space with images and text, which enables multimodal applications such as zero-shot… 

Figures and Tables from this paper

Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer
This work proposes vip-AnT that induces Audio-Text alignment without using any parallel audiotext data, and demonstrates state-of-the-art zero-shot performance on the ESC50 and US8K audio classification tasks, and even surpasses the supervised state of the art for Clotho caption retrieval.
Music2Video: Automatic Generation of Music Video with fusion of audio and text
  • Joel Jang, Sumin Shin, Yoonjeon Kim
  • Computer Science, Engineering
  • 2022
The proposed framework for generating music video shows promising results in application level where users can interactively feed in music source and text source to create artistic music videos.


AudioCLIP: Extending CLIP to Image, Text and Audio
The proposed model incorporates the ESResNeXt audio-model into the CLIP framework using the AudioSet dataset, which enables the proposed model to perform bimodal and unimodal classification and querying, while keeping CLIP’s ability to generalize to unseen datasets in a zero-shot inference fashion.
Vggsound: A Large-Scale Audio-Visual Dataset
The goal is to collect a large-scale audio-visual dataset with low label noise from videos ‘in the wild’ using computer vision techniques and investigates various Convolutional Neural Network architectures and aggregation approaches to establish audio recognition baselines for this new dataset.
wav2vec: Unsupervised Pre-training for Speech Recognition
Wav2vec is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training and outperforms Deep Speech 2, the best reported character-based system in the literature while using two orders of magnitude less labeled training data.
Contrastive Learning of Musical Representations
This work introduces SimCLR to the music domain and contributes a large chain of audio data augmentations, to form a simple framework for self-supervised learning of raw waveforms of music: CLMR, which shows that its representations are transferable using out-of-domain datasets, indicating that they capture important musical knowledge.
Multi-Task Self-Supervised Pre-Training for Music Classification
  • Ho-Hsiang Wu, Chieh-Chi Kao, +4 authors Chao Wang
  • Computer Science, Engineering
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
This paper applies self-supervised and multi-task learning methods for pre-training music encoders, and explores various design choices including encoder architectures, weighting mechanisms to combine losses from multiple tasks, and worker selections of pretext tasks to investigate how these design choices interact with various downstream music classification tasks.
Look, Listen and Learn
There is a valuable, but so far untapped, source of information contained in the video itself – the correspondence between the visual and the audio streams, and a novel “Audio-Visual Correspondence” learning task that makes use of this.
Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings
This paper investigates how L3-Net design choices impact the performance of downstream audio classifiers trained with these embeddings, and shows that audio-informed choices of input representation are important, and that using sufficient data for training the embedding is key.
SoundNet: Learning Sound Representations from Unlabeled Video
This work proposes a student-teacher training procedure which transfers discriminative visual knowledge from well established visual recognition models into the sound modality using unlabeled video as a bridge, and suggests some high-level semantics automatically emerge in the sound network, even though it is trained without ground truth labels.
Multi-Task Self-Supervised Learning for Robust Speech Recognition
PASE+ is proposed, an improved version of PASE that better learns short- and long-term speech dynamics with an efficient combination of recurrent and convolutional networks and learns transferable representations suitable for highly mismatched acoustic conditions.
Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds
This work presents AudioScope, a novel audio-visual sound separation framework that can be trained without supervision to isolate on-screen sound sources from real in-the-wild videos, using a dataset of video clips extracted from open-domain YFCC100m video data.