• Corpus ID: 238634813

Improving the Performance of Automated Audio Captioning via Integrating the Acoustic and Semantic Information

  title={Improving the Performance of Automated Audio Captioning via Integrating the Acoustic and Semantic Information},
  author={Zhongjie Ye and Helin Wang and Dongchao Yang and Yuexian Zou},
  booktitle={Workshop on Detection and Classification of Acoustic Scenes and Events},
Automated audio captioning (AAC) has developed rapidly in recent years, involving acoustic signal processing and natural language processing to generate human-readable sentences for audio clips. The current models are generally based on the neural encoderdecoder architecture, and their decoder mainly uses acoustic information that is extracted from the CNN-based encoder. However, they have ignored semantic information that could help the AAC model to generate meaningful descriptions. This paper… 

Figures and Tables from this paper

iCNN-Transformer: An improved CNN-Transformer with Channel-spatial Attention and Keyword Prediction for Automated Audio Captioning

The results show that the proposed approach to guide the generation of captioning by multi-level information extracted from audio clip can significantly improve the scores of various evaluation metrics and achieve the state-of-the-art performance in the Cross-entropy training stage.

FeatureCut: An Adaptive Data Augmentation for Automated Audio Captioning

An online data augmentation method (FeatureCut) incorporating the encoder-decoder framework to enable the language decoder fully make use of the acoustic features in generating the captions and applies Kullback-Leibler divergence (K-L divergence) between original and augmented data to encourage AAC models to make similar predictions from different views of them, in order to balance the learning capability of ACC models.

Automated audio captioning: an overview of recent progress and new challenges

A comprehensive review of the published contributions in automated audio captioning, from a variety of existing approaches to evaluation metrics and datasets is presented.

Diverse Audio Captioning Via Adversarial Training

This work proposes an adversarial training framework for audio captioning based on a conditional generative adversarial network (C-GAN), which aims at improving the naturalness and diversity of generated captions.

Language-Based Audio Retrieval with Textual Embeddings of Tag Names

This work proposes a first system based on large scale pretrained models to extract audio and text embeddings, using logits predicted over the set of 527 AudioSet tag categories, instead of the most commonly used 2-d feature maps extracted from earlier layers in a deep neural network.

Towards Generating Diverse Audio Captions via Adversarial Training

An adversarial training framework based on a conditional generative adversarial network (C-GAN) to improve diversity of audio captioning systems is proposed and the results show that the proposed model can generate captions with better diversity as compared to state-of-the-art methods.

A Comprehensive Survey of Automated Audio Captioning

This current paper situates itself as a comprehensive review covering the benchmark datasets, existing deep learning techniques and the evaluation metrics in automated audio captioning.

Automated Audio Captioning with Epochal Difficult Captions for curriculum learning

An algorithm, Epochal Difficult Captions, is proposed to supplement the training of any model for the Automated Audio Captioning task and consistently improves performance by up to 0.013 SPIDEr score.



Investigating Local and Global Information for Automated Audio Captioning with Transfer Learning

A topic model for audio descriptions is proposed, comprehensively analyzing the hierarchical audio topics that are commonly covered and it is discovered that local information and abstract representation learning are more crucial to AAC than global information and temporal relationship learning.

An Encoder-Decoder Based Audio Captioning System with Transfer and Reinforcement Learning

The results show that the proposed techniques significantly improve the scores of the evaluation metrics, however, reinforcement learning may impact adversely on the quality of the generated captions.


A sequenceto-sequence model is proposed which consists of a CNN encoder and a Transformer decoder and can achieve a SPIDEr score of 0.227 on audio captioning performance.

A Transformer-based Audio Captioning Model with Keyword Estimation

A Transformer-based audio-captioning model with keyword estimation called TRACKE that simultaneously solves the word-selection indeterminacy problem with the main task of AAC while executing the sub-task of acoustic event detection/acoustic scene classification (i.e., keyword estimation).


This technical report describes the ADSPLAB team’s submission for Task6 of DCASE2020 challenge (automated audio captioning) and shows that the system could achieve the SPIDEr of 0.172 on the evaluation split of the Clotho dataset.

Audio Caption: Listen and Tell

  • Mengyue WuHeinrich DinkelKai Yu
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
A manually-annotated dataset for audio caption is introduced to automatically generate natural sentences for audio scene description and to bridge the gap between machine perception of audio and image.

AudioCaps: Generating Captions for Audios in The Wild

A large-scale dataset of 46K audio clips with human-written text pairs collected via crowdsourcing on the AudioSet dataset is contributed and two novel components that help improve audio captioning performance are proposed: the top-down multi-scale encoder and aligned semantic attention.

Clotho: an Audio Captioning Dataset

Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results, is presented.

PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition

This paper proposes pretrained audio neural networks (PANNs) trained on the large-scale AudioSet dataset, and investigates the performance and computational complexity of PANNs modeled by a variety of convolutional neural networks.

Audio Set: An ontology and human-labeled dataset for audio events

The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.