COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations

@article{Favory2020COALACA,
  title={COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations},
  author={Xavier Favory and Konstantinos Drossos and Tuomas Virtanen and Xavier Serra},
  journal={ArXiv},
  year={2020},
  volume={abs/2006.08386}
}
Audio representation learning based on deep neural networks (DNNs) emerged as an alternative approach to hand-crafted features. For achieving high performance, DNNs often need a large amount of annotated data which can be difficult and costly to obtain. In this paper, we propose a method for learning audio representations, aligning the learned latent representations of audio and associated tags. Aligning is done by maximizing the agreement of the latent representations of audio and tags, using… 

Figures and Tables from this paper

Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio and Tags
TLDR
The results show that employing multi-head self-attention with multiple heads in the tag-based network can induce better learned audio representations.
Learning music audio representations via weak language supervision
TLDR
This work designs a multimodal architecture for music and language pre-training (MuLaP) optimised via a set of proxy tasks and confirms that MuLaP effectively leverages audio-caption pairs to learn representations that are competitive with audio-only and cross-modal self-supervised methods in the literature.
Enriched Music Representations With Multiple Cross-Modal Contrastive Learning
TLDR
This paper aligns the latent representations obtained from playlists-track interactions, genre metadata, and the tracks’ audio, by maximizing the agreement between these modality representations using a contrastive loss.
MusCaps: Generating Captions for Music Audio
TLDR
This work presents the first music audio captioning model, MusCaps, consisting of an encoder-decoder with temporal attention, which represents a shift away from classificationbased music description and combines tasks requiring both auditory and linguistic understanding to bridge the semantic gap in music information retrieval.
Evaluating Off-the-Shelf Machine Listening and Natural Language Models for Automated Audio Captioning
TLDR
This paper utilizes the freely available Clotho dataset to compare four different pre-trained machine listening models, four word embedding models, and their combinations in many different settings and suggests that YAMNet combined with BERT embeddings produces the best captions.
Deep Learning-Based Music Instrument Recognition: Exploring Learned Feature Representations
TLDR
This work follows a state-of-the-art AIR approach that combines a deep convolutional neural network architecture with an attention mechanism that is conditioned on a learned input feature representation, which itself is extracted by another CNN model acting as a feature extractor.
Audio Self-supervised Learning: A Survey
TLDR
An overview of the SSL methods used for audio and speech processing applications, the empirical works that exploit the audio modality in multimodal SSL frameworks, and the existing suitable benchmarks to evaluate the power of SSL in the computer audition domain are summarized.
Audio Captioning using Pre-Trained Large-Scale Language Model Guided by Audio-based Similar Caption Retrieval
TLDR
Experimental results show that the proposed method has succeeded to use a pre- trained language model for audio captioning, and the oracle performance of the pre-trained model-based caption generator was clearly better than that of the conventional method trained from scratch.
Contrastive Representation Learning: A Framework and Review
TLDR
A general Contrastive Representation Learning framework is proposed that simplifies and unifies many different contrastive learning methods and a taxonomy for each of the components is provided in order to summarise and distinguish it from other forms of machine learning.
FSD50K: An Open Dataset of Human-Labeled Sound Events
TLDR
FSD50K is introduced, an open dataset containing over 51 k audio clips totalling over 100 h of audio manually labeled using 200 classes drawn from the AudioSet Ontology, to provide an alternative benchmark dataset and thus foster SER research.
...
1
2
...

References

SHOWING 1-10 OF 44 REFERENCES
Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings
TLDR
This paper investigates how L3-Net design choices impact the performance of downstream audio classifiers trained with these embeddings, and shows that audio-informed choices of input representation are important, and that using sufficient data for training the embedding is key.
musicnn: Pre-trained convolutional neural networks for music audio tagging
TLDR
The musicnn library contains a set of pre-trained musically motivated convolutional neural networks for music audio tagging, which can be used as out-of-the-box music audio taggers, as music feature extractors, or as pre- trained models for transfer learning.
Sequence to Sequence Autoencoders for Unsupervised Representation Learning from Audio
TLDR
This paper proposes a system for this task using a recurrent sequence to sequence autoencoder for unsupervised representation learning from raw audio files, and trains a multilayer perceptron neural network on these feature vectors to predict the class labels.
SampleCNN: End-to-End Deep Convolutional Neural Networks Using Very Small Filters for Music Classification
TLDR
A CNN architecture which learns representations using sample-level filters beyond typical frame-level input representations is proposed and extended using multi-level and multi-scale feature aggregation technique and subsequently conduct transfer learning for several music classification tasks.
Generalized Zero- and Few-Shot Learning via Aligned Variational Autoencoders
TLDR
This work proposes a model where a shared latent space of image features and class embeddings is learned by modality-specific aligned variational autoencoders, and align the distributions learned from images and from side-information to construct latent features that contain the essential multi-modal information associated with unseen classes.
Transfer Learning by Supervised Pre-training for Audio-based Music Classification
TLDR
It is shown that features learned from MSD audio fragments in a supervised manner, using tag labels and user listening data, consistently outperform features learned in an unsupervised manner in this setting, provided that the learned feature extractor is of limited complexity.
End-to-end Learning for Music Audio Tagging at Scale
TLDR
This work focuses on studying how waveform-based models outperform spectrogram-based ones in large-scale data scenarios when datasets of variable size are available for training, suggesting that music domain assumptions are relevant when not enough training data are available.
Representation Learning of Music Using Artist Labels
TLDR
This paper presents a feature learning approach that utilizes artist labels attached in every single music track as an objective meta data and trains a deep convolutional neural network to classify audio tracks into a large number of artists.
Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders
TLDR
A powerful new WaveNet-style autoencoder model is detailed that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform, and NSynth, a large-scale and high-quality dataset of musical notes that is an order of magnitude larger than comparable public datasets is introduced.
Tensorflow Audio Models in Essentia
TLDR
This work presents a set of algorithms that employ TensorFlow in Essentia, allow predictions with pre-trained deep learning models, and are designed to offer flexibility of use, easy extensibility, and real-time inference.
...
1
2
3
4
5
...