MusCaps: Generating Captions for Music Audio

  title={MusCaps: Generating Captions for Music Audio},
  author={Ilaria Manco and Emmanouil Benetos and Elio Quinton and Gy{\"o}rgy Fazekas},
  journal={2021 International Joint Conference on Neural Networks (IJCNN)},
Content-based music information retrieval has seen rapid progress with the adoption of deep learning. Current approaches to high-level music description typically make use of classification models, such as in auto-tagging or genre and mood classification. In this work, we propose to address music description via audio captioning, defined as the task of generating a natural language description of music audio content in a human-like manner. To this end, we present the first music audio… 

Figures and Tables from this paper

Textomics: A Dataset for Genomics Data Summary Generation
Inspired by the successful applications of k nearest neighbors in modeling genomics data, a kNN-Vec2Text model is proposed to address two novel tasks: generating textual summary from a genomicsData matrix and vice versa and substantial improvement on this dataset is observed.


Audio Captioning using Gated Recurrent Units
A novel deep network architecture with audio embeddings is presented to predict audio captions and the experimental results show that the proposed BiGRU-based deep model outperforms the state of the art results.
Multi-task Regularization Based on Infrequent Classes for Audio Captioning
This paper proposes two methods to mitigate the class imbalance problem in an autoencoder setting for audio captioning, and defines a multi-label side task based on clip-level content word detection by training a separate decoder.
AudioCaps: Generating Captions for Audios in The Wild
A large-scale dataset of 46K audio clips with human-written text pairs collected via crowdsourcing on the AudioSet dataset is contributed and two novel components that help improve audio captioning performance are proposed: the top-down multi-scale encoder and aligned semantic attention.
Listen Carefully and Tell: An Audio Captioning System Based on Residual Learning and Gammatone Audio Representation
This work proposes an automatic audio captioning based on residual learning on the encoder phase that surpasses the baseline system in challenge results.
Audio Captioning using Pre-Trained Large-Scale Language Model Guided by Audio-based Similar Caption Retrieval
Experimental results show that the proposed method has succeeded to use a pre- trained language model for audio captioning, and the oracle performance of the pre-trained model-based caption generator was clearly better than that of the conventional method trained from scratch.
End-to-end Learning for Music Audio Tagging at Scale
This work focuses on studying how waveform-based models outperform spectrogram-based ones in large-scale data scenarios when datasets of variable size are available for training, suggesting that music domain assumptions are relevant when not enough training data are available.
Automated audio captioning with recurrent neural networks
Results from metrics show that the proposed method can predict words appearing in the original caption, but not always correctly ordered.
Transfer Learning by Supervised Pre-training for Audio-based Music Classification
It is shown that features learned from MSD audio fragments in a supervised manner, using tag labels and user listening data, consistently outperform features learned in an unsupervised manner in this setting, provided that the learned feature extractor is of limited complexity.
Audio Caption: Listen and Tell
  • Mengyue Wu, Heinrich Dinkel, Kai Yu
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
A manually-annotated dataset for audio caption is introduced to automatically generate natural sentences for audio scene description and to bridge the gap between machine perception of audio and image.
Data-Driven Harmonic Filters for Audio Representation Learning
Experimental results show that a simple convolutional neural network back-end with the proposed front-end outperforms state-of-the-art baseline methods in automatic music tagging, keyword spotting, and sound event tagging tasks.