MusCaps: Generating Captions for Music Audio

  title={MusCaps: Generating Captions for Music Audio},
  author={Ilaria Manco and Emmanouil Benetos and Elio Quinton and Gy{\"o}rgy Fazekas},
  journal={2021 International Joint Conference on Neural Networks (IJCNN)},
Content-based music information retrieval has seen rapid progress with the adoption of deep learning. Current approaches to high-level music description typically make use of classification models, such as in auto-tagging or genre and mood classification. In this work, we propose to address music description via audio captioning, defined as the task of generating a natural language description of music audio content in a human-like manner. To this end, we present the first music audio… 

Figures and Tables from this paper

Contrastive Audio-Language Learning for Music

This work proposes MusCALL, a framework for Music Contrastive Audio-Language Learning, a dual-encoder architecture that learns the alignment between pairs of music audio and descriptive sentences, producing multimodal embeddings that can be used for text-to-audio and audio- to-text retrieval out-of-the-box.

Interpreting Song Lyrics with an Audio-Informed Pre-trained Language Model

Experimental results show that the additional audio information helps the BART-fusion model to understand words and music better, and to generate precise and precise interpretations.

Music Question Answering:Cognize and Perceive Music

The Music Question Answering task is put forward, which aims to provide accurate answers given music and related questions and made MQAdataset based on MagnaTagATune, which contains seven basic categories.

Textomics: A Dataset for Genomics Data Summary Generation

Inspired by the successful applications of k nearest neighbors in modeling genomics data, a kNN-Vec2Text model is proposed to address two novel tasks: generating textual summary from a genomicsData matrix and vice versa and substantial improvement on this dataset is observed.

Building Dataset Textomics Dataset Tasks and Applications Vec 2 Text Text 2

  • Computer Science
  • 2021
Summarizing biomedical discovery from genomics data using natural languages is an essential step in biomedical research but is mostly done manually. Here, we introduce Textomics, a novel dataset of



Audio Captioning using Gated Recurrent Units

A novel deep network architecture with audio embeddings is presented to predict audio captions and the experimental results show that the proposed BiGRU-based deep model outperforms the state of the art results.

COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations

The results are promising, sometimes in par with the state-of-the-art in the considered tasks and the embeddings produced with the method are well correlated with some acoustic descriptors.

Audio Captioning using Pre-Trained Large-Scale Language Model Guided by Audio-based Similar Caption Retrieval

Experimental results show that the proposed method has succeeded to use a pre- trained language model for audio captioning, and the oracle performance of the pre-trained model-based caption generator was clearly better than that of the conventional method trained from scratch.

WaveTransformer: An Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information

This work presents a novel AAC method, explicitly focused on the exploitation of the temporal and time-frequency patterns in audio, utilizing the widely used Transformer decoder and utilizing the freely available splits of the Clotho dataset.

End-to-end Learning for Music Audio Tagging at Scale

This work focuses on studying how waveform-based models outperform spectrogram-based ones in large-scale data scenarios when datasets of variable size are available for training, suggesting that music domain assumptions are relevant when not enough training data are available.

Automated audio captioning with recurrent neural networks

Results from metrics show that the proposed method can predict words appearing in the original caption, but not always correctly ordered.

Transfer Learning by Supervised Pre-training for Audio-based Music Classification

It is shown that features learned from MSD audio fragments in a supervised manner, using tag labels and user listening data, consistently outperform features learned in an unsupervised manner in this setting, provided that the learned feature extractor is of limited complexity.

Evaluation of CNN-based Automatic Music Tagging Models

A consistent evaluation of different music tagging models on three datasets is conducted and reference results using common evaluation metrics are provided and all the models are evaluated with perturbed inputs to investigate the generalization capabilities concerning time stretch, pitch shift, dynamic range compression, and addition of white noise.

A Transformer-based Audio Captioning Model with Keyword Estimation

A Transformer-based audio-captioning model with keyword estimation called TRACKE that simultaneously solves the word-selection indeterminacy problem with the main task of AAC while executing the sub-task of acoustic event detection/acoustic scene classification (i.e., keyword estimation).

Video Understanding as Machine Translation

This work removes the need for negative sampling by taking a generative modeling approach that poses the objective as a translation problem between modalities, which allows for a wide variety of downstream video understanding tasks by means of a single unified framework.