Wav2CLIP: Learning Robust Audio Representations from Clip
@article{Wu2021Wav2CLIPLR, title={Wav2CLIP: Learning Robust Audio Representations from Clip}, author={Ho-Hsiang Wu and Prem Seetharaman and Kundan Kumar and Juan Pablo Bello}, journal={ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, year={2021}, pages={4563-4567} }
We propose Wav2CLIP, a robust audio representation learning method by distilling from Contrastive Language-Image Pre-training (CLIP). We systematically evaluate Wav2CLIP on a variety of audio tasks including classification, retrieval, and generation, and show that Wav2CLIP can outperform several publicly available pre-trained audio representation algorithms. Wav2CLIP projects audio into a shared embedding space with images and text, which enables multimodal applications such as zero-shot…
71 Citations
WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research
- Computer ScienceArXiv
- 2023
This work introduces WavCaps, the first large-scale weakly-labelled audio captioning dataset, and proposes a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT, a large language model, is leveraged to filter and transform raw descriptions automatically.
New Audio Representations Image Gan Generation from BriVL
- Computer ScienceArXiv
- 2023
Experimental results show that this robust audio representation learning method WavBriVL can effectively generate appropriate images from audio and explore a new way of image generation, that is, use audio to generate pictures.
Language-Guided Audio-Visual Source Separation via Trimodal Consistency
- Computer Science
- 2023
Off-the-shelf vision-language foundation models are adapted to provide pseudo-target supervision via two novel loss functions and encourage a stronger alignment between the audio, visual and natural language modalities to overcome a key challenge in this task.
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation
- Computer ScienceICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2023
A pipeline of contrastive language-audio pretraining to develop an audio representation by combining audio data with natural language descriptions is proposed and it is demonstrated that the model achieves superior performance in text-to-audio retrieval task.
I Hear Your True Colors: Image Guided Audio Generation
- Computer ScienceICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2023
Im2Wav, an image guided open-domain audio generation system based on two Transformer language models, that operate over a hierarchical discrete audio representation obtained from a VQ-VAE based model significantly outperforms the evaluated baselines in both fidelity and relevance evaluation metrics.
Pengi: An Audio Language Model for Audio Tasks
- Computer ScienceArXiv
- 2023
Pengi is introduced, a novel Audio Language Model that leverages Transfer Learning by framing all audio tasks as text-generation tasks, and shows that connecting language models with audio models is a major step towards general-purpose audio understanding.
CLAP: Learning Audio Concepts From Natural Language Supervision
- Computer ScienceICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2023
Contrastive Language-Audio Pretraining (CLAP), which learns to connect language and audio by using two encoders and a contrastive learning to bring audio and text descriptions into a joint multimodal space, and establishes SoTA for Zero-Shot performance.
BYOL for Audio: Exploring Pre-Trained General-Purpose Audio Representations
- Computer ScienceIEEE/ACM Transactions on Audio, Speech, and Language Processing
- 2023
This study hypothesizes that representations effective for general audio tasks should provide multiple aspects of robust features of the input sound and proposes a self-supervised learning method, Bootstrap Your Own Latent for Audio (BYOL-A, pronounced “viola”), which makes the learned representations robust to the perturbations of sounds.
BLAT: Bootstrapping Language-Audio Pre-training based on AudioSet Tag-guided Synthetic Data
- Computer ScienceArXiv
- 2023
Evaluation on a series of downstream tasks indicates that BLAT achieves SOTA zero-shot classification performance on most datasets and significant performance improvement when fine-tuned on downstream tasks, suggesting the effectiveness of the synthetic data.
Unsupervised Improvement of Audio-Text Cross-Modal Representations
- Computer ScienceArXiv
- 2023
This paper explores domain-unspecific and domain-specific curation methods to create audio-text pairs that are able to obtain significant improvement in terms of zero-shot classification performance on downstream sound event classification or acoustic scene classification tasks.
35 References
Audioclip: Extending Clip to Image, Text and Audio
- Computer ScienceICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2022
An extension of the CLIP model that handles audio in addition to text and images that achieves new state-of-the-art results in the Environmental Sound Classification (ESC) task and out-performs others by reaching accuracies of 97.15 % on ESC-50 and 90.07 % on UrbanSound8K.
Vggsound: A Large-Scale Audio-Visual Dataset
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
The goal is to collect a large-scale audio-visual dataset with low label noise from videos ‘in the wild’ using computer vision techniques and investigates various Convolutional Neural Network architectures and aggregation approaches to establish audio recognition baselines for this new dataset.
wav2vec: Unsupervised Pre-training for Speech Recognition
- Computer ScienceINTERSPEECH
- 2019
Wav2vec is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training and outperforms Deep Speech 2, the best reported character-based system in the literature while using two orders of magnitude less labeled training data.
Contrastive Learning of Musical Representations
- Computer ScienceISMIR
- 2021
ClMR is introduced to the music domain and contributes a large chain of audio data augmentations to form a simple framework for self-supervised, contrastive learning of musical representations: CLMR, which works on raw time-domain music data and requires no labels to learn useful representations.
PSLA: Improving Audio Tagging With Pretraining, Sampling, Labeling, and Aggregation
- Computer ScienceIEEE/ACM Transactions on Audio, Speech, and Language Processing
- 2021
PSLA is presented, a collection of model agnostic training techniques that can noticeably boost the model accuracy including ImageNet pretraining, balanced sampling, data augmentation, label enhancement, model aggregation, and model aggregation.
Multi-Task Self-Supervised Pre-Training for Music Classification
- Computer ScienceICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2021
This paper applies self-supervised and multi-task learning methods for pre-training music encoders, and explores various design choices including encoder architectures, weighting mechanisms to combine losses from multiple tasks, and worker selections of pretext tasks to investigate how these design choices interact with various downstream music classification tasks.
UNITER: UNiversal Image-TExt Representation Learning
- Computer ScienceECCV
- 2020
UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings.
A Simple Framework for Contrastive Learning of Visual Representations
- Computer ScienceICML
- 2020
It is shown that composition of data augmentations plays a critical role in defining effective predictive tasks, and introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning.
Look, Listen and Learn
- Computer Science2017 IEEE International Conference on Computer Vision (ICCV)
- 2017
There is a valuable, but so far untapped, source of information contained in the video itself – the correspondence between the visual and the audio streams, and a novel “Audio-Visual Correspondence” learning task that makes use of this.
Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings
- Computer ScienceICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
This paper investigates how L3-Net design choices impact the performance of downstream audio classifiers trained with these embeddings, and shows that audio-informed choices of input representation are important, and that using sufficient data for training the embedding is key.