Clotho: an Audio Captioning Dataset

@article{Drossos2020ClothoAA,
  title={Clotho: an Audio Captioning Dataset},
  author={Konstantinos Drossos and Samuel Lipping and Tuomas Virtanen},
  journal={ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2020},
  pages={736-740}
}
  • K. Drossos, Samuel Lipping, T. Virtanen
  • Published 21 October 2019
  • Computer Science
  • ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results… 

Figures and Tables from this paper

Automated Audio Captioning with Weakly Supervised Pre-Training and Word Selection Methods
TLDR
This paper proposes a solution of automated audio captioning based on weakly supervised pre-training and word selection methods that achieves the best SPIDEr score of 0.310 in the DCASE 2021 Challenge Task 6.
Temporal Sub-sampling of Audio Feature Sequences for Automated Audio Captioning
TLDR
This work presents an approach that focuses on explicitly taking advantage of the difference of lengths between sequences, by applying a temporal sub-sampling to the audio input sequence by employing a sequence-to-sequence method.
WaveTransformer: An Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information
TLDR
This work presents a novel AAC method, explicitly focused on the exploitation of the temporal and time-frequency patterns in audio, utilizing the widely used Transformer decoder and utilizing the freely available splits of the Clotho dataset.
Caption Feature Space Regularization for Audio Captioning
TLDR
A two-stage framework for audio captioning is proposed that constructs a proxy feature space to reduce the distances between captions correlated to the same audio, and in the second stage, theproxy feature space is utilized as additional supervision to encourage the model to be optimized in the direction that benefits all the correlated captions.
Automated Audio Captioning using Audio Event Clues
TLDR
Results of the extensive experiments show that using audio event labels with the acoustic features improve the recognition performance and the proposed method either outperforms or achieves competitive results with the state- of-the-art models.
Listen Carefully and Tell: An Audio Captioning System Based on Residual Learning and Gammatone Audio Representation
TLDR
This work proposes an automatic audio captioning based on residual learning on the encoder phase that surpasses the baseline system in challenge results.
Multi-task Regularization Based on Infrequent Classes for Audio Captioning
TLDR
This paper proposes two methods to mitigate the class imbalance problem in an autoencoder setting for audio captioning, and defines a multi-label side task based on clip-level content word detection by training a separate decoder.
Evaluating Off-the-Shelf Machine Listening and Natural Language Models for Automated Audio Captioning
TLDR
This paper utilizes the freely available Clotho dataset to compare four different pre-trained machine listening models, four word embedding models, and their combinations in many different settings and suggests that YAMNet combined with BERT embeddings produces the best captions.
Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering
TLDR
Clotho-AQA, a dataset for Audio question answering consisting of 1991 audio files each between 15 to 30 seconds duration selected from the Clotho dataset, is introduced.
Learning Audio-Video Modalities from Image Captions
TLDR
A new video mining pipeline is proposed which involves transferring captions from image captioning datasets to video clips with no additional manual effort, and it is shown that training a multimodal transformed based model on this data achieves competitive performance on video retrieval and video captioning.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 13 REFERENCES
Crowdsourcing a Dataset of Audio Captions
TLDR
A three steps based framework for crowdsourcing an audio captioning dataset is presented, based on concepts and practises followed for the creation of widely used image captioning and machine translations datasets, and results show that the resulting dataset has less typographical errors than the initial captions.
AudioCaps: Generating Captions for Audios in The Wild
TLDR
A large-scale dataset of 46K audio clips with human-written text pairs collected via crowdsourcing on the AudioSet dataset is contributed and two novel components that help improve audio captioning performance are proposed: the top-down multi-scale encoder and aligned semantic attention.
Audio Caption: Listen and Tell
  • Mengyue Wu, Heinrich Dinkel, Kai Yu
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
TLDR
A manually-annotated dataset for audio caption is introduced to automatically generate natural sentences for audio scene description and to bridge the gap between machine perception of audio and image.
Automated audio captioning with recurrent neural networks
TLDR
Results from metrics show that the proposed method can predict words appearing in the original caption, but not always correctly ordered.
Audio Set: An ontology and human-labeled dataset for audio events
TLDR
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.
Deep Visual-Semantic Alignments for Generating Image Descriptions
  • A. Karpathy, Li Fei-Fei
  • Computer Science
    IEEE Transactions on Pattern Analysis and Machine Intelligence
  • 2017
TLDR
A model that generates natural language descriptions of images and their regions based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding is presented.
Microsoft COCO: Common Objects in Context
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene
Freesound technical demo
TLDR
This demo wants to introduce Freesound to the multimedia community and show its potential as a research resource.
Adam: A Method for Stochastic Optimization
TLDR
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
TLDR
Qualitatively, the proposed RNN Encoder‐Decoder model learns a semantically and syntactically meaningful representation of linguistic phrases.
...
1
2
...