Wav2CLIP: Learning Robust Audio Representations from Clip

@article{Wu2021Wav2CLIPLR,
  title={Wav2CLIP: Learning Robust Audio Representations from Clip},
  author={Ho-Hsiang Wu and Prem Seetharaman and Kundan Kumar and Juan Pablo Bello},
  journal={ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2021},
  pages={4563-4567}
}
  • Ho-Hsiang WuPrem Seetharaman J. Bello
  • Published 21 October 2021
  • Computer Science
  • ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
We propose Wav2CLIP, a robust audio representation learning method by distilling from Contrastive Language-Image Pre-training (CLIP). We systematically evaluate Wav2CLIP on a variety of audio tasks including classification, retrieval, and generation, and show that Wav2CLIP can outperform several publicly available pre-trained audio representation algorithms. Wav2CLIP projects audio into a shared embedding space with images and text, which enables multimodal applications such as zero-shot… 

Figures and Tables from this paper

WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

This work introduces WavCaps, the first large-scale weakly-labelled audio captioning dataset, and proposes a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT, a large language model, is leveraged to filter and transform raw descriptions automatically.

New Audio Representations Image Gan Generation from BriVL

Experimental results show that this robust audio representation learning method WavBriVL can effectively generate appropriate images from audio and explore a new way of image generation, that is, use audio to generate pictures.

Language-Guided Audio-Visual Source Separation via Trimodal Consistency

Off-the-shelf vision-language foundation models are adapted to provide pseudo-target supervision via two novel loss functions and encourage a stronger alignment between the audio, visual and natural language modalities to overcome a key challenge in this task.

Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation

A pipeline of contrastive language-audio pretraining to develop an audio representation by combining audio data with natural language descriptions is proposed and it is demonstrated that the model achieves superior performance in text-to-audio retrieval task.

I Hear Your True Colors: Image Guided Audio Generation

  • Roy ShefferYossi Adi
  • Computer Science
    ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2023
Im2Wav, an image guided open-domain audio generation system based on two Transformer language models, that operate over a hierarchical discrete audio representation obtained from a VQ-VAE based model significantly outperforms the evaluated baselines in both fidelity and relevance evaluation metrics.

Pengi: An Audio Language Model for Audio Tasks

Pengi is introduced, a novel Audio Language Model that leverages Transfer Learning by framing all audio tasks as text-generation tasks, and shows that connecting language models with audio models is a major step towards general-purpose audio understanding.

CLAP: Learning Audio Concepts From Natural Language Supervision

Contrastive Language-Audio Pretraining (CLAP), which learns to connect language and audio by using two encoders and a contrastive learning to bring audio and text descriptions into a joint multimodal space, and establishes SoTA for Zero-Shot performance.

BYOL for Audio: Exploring Pre-Trained General-Purpose Audio Representations

This study hypothesizes that representations effective for general audio tasks should provide multiple aspects of robust features of the input sound and proposes a self-supervised learning method, Bootstrap Your Own Latent for Audio (BYOL-A, pronounced “viola”), which makes the learned representations robust to the perturbations of sounds.

BLAT: Bootstrapping Language-Audio Pre-training based on AudioSet Tag-guided Synthetic Data

Evaluation on a series of downstream tasks indicates that BLAT achieves SOTA zero-shot classification performance on most datasets and significant performance improvement when fine-tuned on downstream tasks, suggesting the effectiveness of the synthetic data.

Unsupervised Improvement of Audio-Text Cross-Modal Representations

This paper explores domain-unspecific and domain-specific curation methods to create audio-text pairs that are able to obtain significant improvement in terms of zero-shot classification performance on downstream sound event classification or acoustic scene classification tasks.
...

Audioclip: Extending Clip to Image, Text and Audio

An extension of the CLIP model that handles audio in addition to text and images that achieves new state-of-the-art results in the Environmental Sound Classification (ESC) task and out-performs others by reaching accuracies of 97.15 % on ESC-50 and 90.07 % on UrbanSound8K.

Vggsound: A Large-Scale Audio-Visual Dataset

The goal is to collect a large-scale audio-visual dataset with low label noise from videos ‘in the wild’ using computer vision techniques and investigates various Convolutional Neural Network architectures and aggregation approaches to establish audio recognition baselines for this new dataset.

wav2vec: Unsupervised Pre-training for Speech Recognition

Wav2vec is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training and outperforms Deep Speech 2, the best reported character-based system in the literature while using two orders of magnitude less labeled training data.

Contrastive Learning of Musical Representations

ClMR is introduced to the music domain and contributes a large chain of audio data augmentations to form a simple framework for self-supervised, contrastive learning of musical representations: CLMR, which works on raw time-domain music data and requires no labels to learn useful representations.

PSLA: Improving Audio Tagging With Pretraining, Sampling, Labeling, and Aggregation

PSLA is presented, a collection of model agnostic training techniques that can noticeably boost the model accuracy including ImageNet pretraining, balanced sampling, data augmentation, label enhancement, model aggregation, and model aggregation.

Multi-Task Self-Supervised Pre-Training for Music Classification

This paper applies self-supervised and multi-task learning methods for pre-training music encoders, and explores various design choices including encoder architectures, weighting mechanisms to combine losses from multiple tasks, and worker selections of pretext tasks to investigate how these design choices interact with various downstream music classification tasks.

UNITER: UNiversal Image-TExt Representation Learning

UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings.

A Simple Framework for Contrastive Learning of Visual Representations

It is shown that composition of data augmentations plays a critical role in defining effective predictive tasks, and introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning.

Look, Listen and Learn

There is a valuable, but so far untapped, source of information contained in the video itself – the correspondence between the visual and the audio streams, and a novel “Audio-Visual Correspondence” learning task that makes use of this.

Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings

This paper investigates how L3-Net design choices impact the performance of downstream audio classifiers trained with these embeddings, and shows that audio-informed choices of input representation are important, and that using sufficient data for training the embedding is key.