Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio and Tags
@article{Favory2021LearningCT, title={Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio and Tags}, author={Xavier Favory and Konstantinos Drossos and Tuomas Virtanen and Xavier Serra}, journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, year={2021}, pages={596-600} }
Self-supervised audio representation learning offers an attractive alternative for obtaining generic audio embeddings, capable to be employed into various downstream tasks. Published approaches that consider both audio and words/tags associated with audio do not employ text processing models that are capable to generalize to tags unknown during training. In this work we propose a method for learning audio representations using an audio autoencoder (AAE), a general word embed-dings model (WEM…
8 Citations
Learning music audio representations via weak language supervision
- Computer ScienceICASSP
- 2022
This work designs a multimodal architecture for music and language pre-training (MuLaP) optimised via a set of proxy tasks and confirms that MuLaP effectively leverages audio-caption pairs to learn representations that are competitive with audio-only and cross-modal self-supervised methods in the literature.
Enriched Music Representations With Multiple Cross-Modal Contrastive Learning
- Computer ScienceIEEE Signal Processing Letters
- 2021
This paper aligns the latent representations obtained from playlists-track interactions, genre metadata, and the tracks’ audio, by maximizing the agreement between these modality representations using a contrastive loss.
Codified audio language modeling learns useful representations for music information retrieval
- Computer ScienceArXiv
- 2021
The strength of Jukebox’s representations are interpreted as evidence that modeling audio instead of tags provides richer representations for MIR.
CT-SAT: Contextual Transformer for Sequential Audio Tagging
- Computer ScienceArXiv
- 2022
A contextual Transformer (cTransformer) with a bidirectional decoder that could exploit the forward and backward information of event sequences is proposed, and experiments show that, compared to CTC-based methods, the cTransformer can effectively combine the fine-grained acoustic representations from the encoder and coarse- grained audio event cues to exploit contextual information to successfully recognize and predict audio event sequences.
Unsupervised Mismatch Localization in Cross-Modal Sequential Data
- Computer ScienceArXiv
- 2022
A hierarchical Bayesian deep learning model that decomposes the generative process of the speech into hierarchically structured latent variables, indicating the relationship between content-mismatched cross-modal sequential data, especially for speech-text sequences is proposed.
Beyond Just Vision: A Review on Self-Supervised Representation Learning on Multimodal and Temporal Data
- Computer ScienceArXiv
- 2022
This work aims to provide the first comprehensive review of multimodal self-supervised learning methods for temporal data, and introduces a generic pipeline by defining the key components of a SSRL framework.
Deep Unsupervised Contrastive Hashing for Large-Scale Cross-Modal Text-Image Retrieval in Remote Sensing
- Computer ScienceArXiv
- 2022
Experimental results show that the proposed DUCH outperforms state-of-the-art unsupervised cross- modal hashing methods on two multi-modal (image and text) benchmark archives in RS.
Multimodal Learning with Transformers: A Survey
- Computer ScienceArXiv
- 2022
A comprehensive survey of Transformer techniques oriented at multimodal data and a discussion of open problems and potential research directions for the community are presented.
References
SHOWING 1-10 OF 26 REFERENCES
COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations
- Computer ScienceArXiv
- 2020
The results are promising, sometimes in par with the state-of-the-art in the considered tasks and the embeddings produced with the method are well correlated with some acoustic descriptors.
Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings
- Computer ScienceICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
This paper investigates how L3-Net design choices impact the performance of downstream audio classifiers trained with these embeddings, and shows that audio-informed choices of input representation are important, and that using sufficient data for training the embedding is key.
Unsupervised Learning of Semantic Audio Representations
- Computer Science2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2018
This work considers several class-agnostic semantic constraints that apply to unlabeled nonspeech audio and proposes low-dimensional embeddings of the input spectrograms that recover 41% and 84% of the performance of their fully-supervised counterparts when applied to downstream query-by-example sound retrieval and sound event classification tasks, respectively.
Semi-supervised Triplet Loss Based Learning of Ambient Audio Embeddings
- Computer ScienceICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
This paper combines unsupervised and supervised triplet loss based learning into a semi-supervised representation learning approach, whereby the positive samples for those triplets whose anchors are unlabeled are obtained either by applying a transformation to the anchor, or by selecting the nearest sample in the training set.
musicnn: Pre-trained convolutional neural networks for music audio tagging
- Computer ScienceArXiv
- 2019
The musicnn library contains a set of pre-trained musically motivated convolutional neural networks for music audio tagging, which can be used as out-of-the-box music audio taggers, as music feature extractors, or as pre- trained models for transfer learning.
Self-Supervised Learning by Cross-Modal Audio-Video Clustering
- Computer ScienceNeurIPS
- 2020
Cross-Modal Deep Clustering (XDC), a novel self-supervised method that leverages unsupervised clustering in one modality as a supervisory signal for the other modality, is proposed, which is the first self- supervised learning method that outperforms large-scale fully- Supervised pretraining for action recognition on the same architecture.
Audio Set: An ontology and human-labeled dataset for audio events
- Computer Science2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2017
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.
SampleCNN: End-to-End Deep Convolutional Neural Networks Using Very Small Filters for Music Classification
- Computer Science
- 2018
A CNN architecture which learns representations using sample-level filters beyond typical frame-level input representations is proposed and extended using multi-level and multi-scale feature aggregation technique and subsequently conduct transfer learning for several music classification tasks.
Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders
- Computer ScienceICML
- 2017
A powerful new WaveNet-style autoencoder model is detailed that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform, and NSynth, a large-scale and high-quality dataset of musical notes that is an order of magnitude larger than comparable public datasets is introduced.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- Computer ScienceNAACL
- 2019
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.