AudioCLIP: Extending CLIP to Image, Text and Audio

@inproceedings{Guzhov2022AudioCLIPEC,
  title={AudioCLIP: Extending CLIP to Image, Text and Audio},
  author={Andrey Guzhov and Federico Raue and J{\"o}rn Hees and Andreas R. Dengel},
  booktitle={ICASSP},
  year={2022}
}
In the past, the rapidly evolving field of sound classification greatly benefited from the application of methods from other domains. Today, we observe the trend to fuse domain-specific tasks and approaches together, which provides the community with new outstanding models. In this work, we present an extension of the CLIP model that handles audio in addition to text and images. Our proposed model incorporates the ESResNeXt audio-model into the CLIP framework using the AudioSet dataset. Such a… 

Figures and Tables from this paper

Wav2CLIP: Learning Robust Audio Representations From CLIP
TLDR
Wav2CLIP is proposed, a robust audio representation learning method by distilling from Contrastive Language-Image Pre-training (CLIP), and is more efficient to pretrain than competing methods as it does not require learning a visual model in concert with an auditory model.
Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer
TLDR
This work proposes vip-AnT that induces A udio- T ext alignment without using any parallel audio-text data, and shows state-of-the-art zero-shot performance on the ESC50 and US8K audio classification tasks, and even surpasses the supervised state of the art for Clotho caption retrieval by 2.2% R@1.
Interactive Audio-text Representation for Automated Audio Captioning with Contrastive Learning
TLDR
This work proposes a novel AAC system called CLIP-AAC to learn interactive cross-modality representation with both acoustic and textual information and indicates that both the pre-trained model and contrastive learning contribute to the performance gain of the AAC model.
CLAP: Learning Audio Concepts From Natural Language Supervision
TLDR
Contrastive Language-Audio Pretraining (CLAP), which learns to connect language and audio by using two encoders and a contrastive learning to bring audio and text descriptions into a joint multimodal space, and generalizes to multiple downstream tasks.
Sound-Guided Semantic Video Generation
TLDR
This paper proposes a framework to generate realistic videos by leveraging multimodal (sound-image-text) embedding space and provides the new high-resolution landscape video dataset (audio-visual pair) for the sound-guided video generation task.
Sound-Guided Semantic Image Manipulation
TLDR
This work proposes a framework that directly encodes sound into the multimodal (image-text) embedding space and manipulates an image from the space using a direct latent optimization method based on aligned embeddings for soundguided image manipulation.
Multimodal Knowledge Alignment with Reinforcement Learning
TLDR
This work proposes ESPER, a novel approach to reinforcement learning which extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning, and demonstrates that it outperforms baselines and prior work on a variety of zero- shot tasks.
Everything at Once - Multi-modal Fusion Transformer for Video Retrieval
TLDR
This work presents a multi-modal, modality agnostic fusion transformer approach that learns to exchange information between multiple modalities, such as video, audio, and text, and integrate them into a joined multi-Modal representation to obtain an embedding that aggregates multi- modal temporal information.
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
TLDR
MERLOT Reserve is introduced, a model that represents videos jointly over time – through a new training objective that learns from audio, subtitles, and video frames, which enables out-of-the-box prediction, revealing strong multimodal commonsense understanding.
MotionCLIP: Exposing Human Motion Generation to CLIP Space
TLDR
It is shown that although CLIP has never seen the motion domain, MotionCLIP offers unprecedented text-to-motion abilities, allowing out-of-domain actions, disentangled editing, and abstract language specification.
...
...

References

SHOWING 1-10 OF 34 REFERENCES
Multimodal Self-Supervised Learning of General Audio Representations
TLDR
This work demonstrates that their contrastive framework does not require high resolution images to learn good audio features, and is advantageous on a broad range of non-semantic audio tasks, including speaker identification, keyword spotting, language identification, and music instrument classification.
Self-Supervised MultiModal Versatile Networks
TLDR
This work learns representations using self-supervision by leveraging three modalities naturally present in videos: vision, audio and language by incorporating a novel process of deflation, so that the networks can be effortlessly applied to the visual data in the form of video or a static image.
AudioCaps: Generating Captions for Audios in The Wild
TLDR
A large-scale dataset of 46K audio clips with human-written text pairs collected via crowdsourcing on the AudioSet dataset is contributed and two novel components that help improve audio captioning performance are proposed: the top-down multi-scale encoder and aligned semantic attention.
ESResNet: Environmental Sound Classification Based on Visual Domain Models
TLDR
This work presents a model that is inherently compatible with mono and stereo sound inputs and out-performs all previously known approaches in a fair comparison, based on simple log-power Short-Time Fourier Transform (STFT) spectrograms.
ERANNs: Efficient Residual Audio Neural Networks for Audio Pattern Recognition
TLDR
A new convolutional neural network architecture and a method for proving the inference speed of CNN-based systems for APR tasks are proposed and improved, and these systems are named “Efficient Residual Audio Neural Networks”.
AST: Audio Spectrogram Transformer
TLDR
The Audio Spectrogram Transformer (AST) is introduced, the first convolution-free, purely attention-based model for audio classification, which achieves new state-of-the-art results on various audio classification benchmarks.
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
TLDR
The convolution-free VATT outperforms state-of-the-art ConvNet-based architectures in the downstream tasks and shows the generalizability of the model despite the domain gap between videos and images.
Audio Set: An ontology and human-labeled dataset for audio events
TLDR
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.
MDMMT: Multidomain Multimodal Transformer for Video Retrieval
TLDR
A new state-of-the-art on the text-to-video retrieval task on MSRVTT and LSMDC benchmarks where the model outperforms all previous solutions by a large margin and is achieved using a single model and without finetuning.
ESResNe(X)t-fbsp: Learning Robust Time-Frequency Transformation of Audio
TLDR
A new time-frequency transformation layer that is based on complex frequency B-spline (fbsp) wavelets being used with a high-performance audio classification model, which provides an accuracy improvement over the previously used Short-Time Fourier Transform (STFT) on standard datasets.
...
...