Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio and Tags

  title={Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio and Tags},
  author={Xavier Favory and Konstantinos Drossos and Tuomas Virtanen and Xavier Serra},
  journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  • Xavier Favory, K. Drossos, X. Serra
  • Published 27 October 2020
  • Computer Science
  • ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Self-supervised audio representation learning offers an attractive alternative for obtaining generic audio embeddings, capable to be employed into various downstream tasks. Published approaches that consider both audio and words/tags associated with audio do not employ text processing models that are capable to generalize to tags unknown during training. In this work we propose a method for learning audio representations using an audio autoencoder (AAE), a general word embed-dings model (WEM… 

Figures and Tables from this paper

Learning music audio representations via weak language supervision
This work designs a multimodal architecture for music and language pre-training (MuLaP) optimised via a set of proxy tasks and confirms that MuLaP effectively leverages audio-caption pairs to learn representations that are competitive with audio-only and cross-modal self-supervised methods in the literature.
Enriched Music Representations With Multiple Cross-Modal Contrastive Learning
This paper aligns the latent representations obtained from playlists-track interactions, genre metadata, and the tracks’ audio, by maximizing the agreement between these modality representations using a contrastive loss.
Codified audio language modeling learns useful representations for music information retrieval
The strength of Jukebox’s representations are interpreted as evidence that modeling audio instead of tags provides richer representations for MIR.
CT-SAT: Contextual Transformer for Sequential Audio Tagging
A contextual Transformer (cTransformer) with a bidirectional decoder that could exploit the forward and backward information of event sequences is proposed, and experiments show that, compared to CTC-based methods, the cTransformer can effectively combine the fine-grained acoustic representations from the encoder and coarse- grained audio event cues to exploit contextual information to successfully recognize and predict audio event sequences.
Unsupervised Mismatch Localization in Cross-Modal Sequential Data
A hierarchical Bayesian deep learning model that decomposes the generative process of the speech into hierarchically structured latent variables, indicating the relationship between content-mismatched cross-modal sequential data, especially for speech-text sequences is proposed.
Beyond Just Vision: A Review on Self-Supervised Representation Learning on Multimodal and Temporal Data
This work aims to provide the first comprehensive review of multimodal self-supervised learning methods for temporal data, and introduces a generic pipeline by defining the key components of a SSRL framework.
Deep Unsupervised Contrastive Hashing for Large-Scale Cross-Modal Text-Image Retrieval in Remote Sensing
Experimental results show that the proposed DUCH outperforms state-of-the-art unsupervised cross- modal hashing methods on two multi-modal (image and text) benchmark archives in RS.
Multimodal Learning with Transformers: A Survey
A comprehensive survey of Transformer techniques oriented at multimodal data and a discussion of open problems and potential research directions for the community are presented.


COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations
The results are promising, sometimes in par with the state-of-the-art in the considered tasks and the embeddings produced with the method are well correlated with some acoustic descriptors.
Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings
This paper investigates how L3-Net design choices impact the performance of downstream audio classifiers trained with these embeddings, and shows that audio-informed choices of input representation are important, and that using sufficient data for training the embedding is key.
Unsupervised Learning of Semantic Audio Representations
  • A. Jansen, M. Plakal, R. Saurous
  • Computer Science
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
This work considers several class-agnostic semantic constraints that apply to unlabeled nonspeech audio and proposes low-dimensional embeddings of the input spectrograms that recover 41% and 84% of the performance of their fully-supervised counterparts when applied to downstream query-by-example sound retrieval and sound event classification tasks, respectively.
Semi-supervised Triplet Loss Based Learning of Ambient Audio Embeddings
This paper combines unsupervised and supervised triplet loss based learning into a semi-supervised representation learning approach, whereby the positive samples for those triplets whose anchors are unlabeled are obtained either by applying a transformation to the anchor, or by selecting the nearest sample in the training set.
musicnn: Pre-trained convolutional neural networks for music audio tagging
The musicnn library contains a set of pre-trained musically motivated convolutional neural networks for music audio tagging, which can be used as out-of-the-box music audio taggers, as music feature extractors, or as pre- trained models for transfer learning.
Self-Supervised Learning by Cross-Modal Audio-Video Clustering
Cross-Modal Deep Clustering (XDC), a novel self-supervised method that leverages unsupervised clustering in one modality as a supervisory signal for the other modality, is proposed, which is the first self- supervised learning method that outperforms large-scale fully- Supervised pretraining for action recognition on the same architecture.
Audio Set: An ontology and human-labeled dataset for audio events
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.
SampleCNN: End-to-End Deep Convolutional Neural Networks Using Very Small Filters for Music Classification
A CNN architecture which learns representations using sample-level filters beyond typical frame-level input representations is proposed and extended using multi-level and multi-scale feature aggregation technique and subsequently conduct transfer learning for several music classification tasks.
Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders
A powerful new WaveNet-style autoencoder model is detailed that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform, and NSynth, a large-scale and high-quality dataset of musical notes that is an order of magnitude larger than comparable public datasets is introduced.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.