Improving the Performance of Automated Audio Captioning via Integrating the Acoustic and Semantic Information
@inproceedings{Ye2021ImprovingTP, title={Improving the Performance of Automated Audio Captioning via Integrating the Acoustic and Semantic Information}, author={Zhongjie Ye and Helin Wang and Dongchao Yang and Yuexian Zou}, booktitle={Workshop on Detection and Classification of Acoustic Scenes and Events}, year={2021} }
Automated audio captioning (AAC) has developed rapidly in recent years, involving acoustic signal processing and natural language processing to generate human-readable sentences for audio clips. The current models are generally based on the neural encoderdecoder architecture, and their decoder mainly uses acoustic information that is extracted from the CNN-based encoder. However, they have ignored semantic information that could help the AAC model to generate meaningful descriptions. This paper…
8 Citations
iCNN-Transformer: An improved CNN-Transformer with Channel-spatial Attention and Keyword Prediction for Automated Audio Captioning
- Computer ScienceINTERSPEECH
- 2022
The results show that the proposed approach to guide the generation of captioning by multi-level information extracted from audio clip can significantly improve the scores of various evaluation metrics and achieve the state-of-the-art performance in the Cross-entropy training stage.
FeatureCut: An Adaptive Data Augmentation for Automated Audio Captioning
- Computer Science2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
- 2022
An online data augmentation method (FeatureCut) incorporating the encoder-decoder framework to enable the language decoder fully make use of the acoustic features in generating the captions and applies Kullback-Leibler divergence (K-L divergence) between original and augmented data to encourage AAC models to make similar predictions from different views of them, in order to balance the learning capability of ACC models.
Automated audio captioning: an overview of recent progress and new challenges
- Computer ScienceEURASIP Journal on Audio, Speech, and Music Processing
- 2022
A comprehensive review of the published contributions in automated audio captioning, from a variety of existing approaches to evaluation metrics and datasets is presented.
Diverse Audio Captioning Via Adversarial Training
- Computer ScienceICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2022
This work proposes an adversarial training framework for audio captioning based on a conditional generative adversarial network (C-GAN), which aims at improving the naturalness and diversity of generated captions.
Language-Based Audio Retrieval with Textual Embeddings of Tag Names
- Computer ScienceDCASE
- 2022
This work proposes a first system based on large scale pretrained models to extract audio and text embeddings, using logits predicted over the set of 527 AudioSet tag categories, instead of the most commonly used 2-d feature maps extracted from earlier layers in a deep neural network.
Towards Generating Diverse Audio Captions via Adversarial Training
- Computer ScienceArXiv
- 2022
An adversarial training framework based on a conditional generative adversarial network (C-GAN) to improve diversity of audio captioning systems is proposed and the results show that the proposed model can generate captions with better diversity as compared to state-of-the-art methods.
A Comprehensive Survey of Automated Audio Captioning
- Computer ScienceArXiv
- 2022
This current paper situates itself as a comprehensive review covering the benchmark datasets, existing deep learning techniques and the evaluation metrics in automated audio captioning.
Automated Audio Captioning with Epochal Difficult Captions for curriculum learning
- Computer Science2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
- 2022
An algorithm, Epochal Difficult Captions, is proposed to supplement the training of any model for the Automated Audio Captioning task and consistently improves performance by up to 0.013 SPIDEr score.
References
SHOWING 1-10 OF 25 REFERENCES
Investigating Local and Global Information for Automated Audio Captioning with Transfer Learning
- Computer ScienceICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2021
A topic model for audio descriptions is proposed, comprehensively analyzing the hierarchical audio topics that are commonly covered and it is discovered that local information and abstract representation learning are more crucial to AAC than global information and temporal relationship learning.
An Encoder-Decoder Based Audio Captioning System with Transfer and Reinforcement Learning
- Computer ScienceDCASE
- 2021
The results show that the proposed techniques significantly improve the scores of the evaluation metrics, however, reinforcement learning may impact adversely on the quality of the generated captions.
AUDIO CAPTIONING BASED ON TRANSFORMER AND PRE-TRAINING FOR 2020 DCASE AUDIO CAPTIONING CHALLENGE Technical Report
- Computer Science
- 2020
A sequenceto-sequence model is proposed which consists of a CNN encoder and a Transformer decoder and can achieve a SPIDEr score of 0.227 on audio captioning performance.
A Transformer-based Audio Captioning Model with Keyword Estimation
- Economics, EducationINTERSPEECH
- 2020
A Transformer-based audio-captioning model with keyword estimation called TRACKE that simultaneously solves the word-selection indeterminacy problem with the main task of AAC while executing the sub-task of acoustic event detection/acoustic scene classification (i.e., keyword estimation).
AUTOMATED AUDIO CAPTIONING WITH TEMPORAL ATTENTION Technical Report
- Computer Science
- 2020
This technical report describes the ADSPLAB team’s submission for Task6 of DCASE2020 challenge (automated audio captioning) and shows that the system could achieve the SPIDEr of 0.172 on the evaluation split of the Clotho dataset.
Audio Caption: Listen and Tell
- Computer ScienceICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
A manually-annotated dataset for audio caption is introduced to automatically generate natural sentences for audio scene description and to bridge the gap between machine perception of audio and image.
AudioCaps: Generating Captions for Audios in The Wild
- Computer ScienceNAACL
- 2019
A large-scale dataset of 46K audio clips with human-written text pairs collected via crowdsourcing on the AudioSet dataset is contributed and two novel components that help improve audio captioning performance are proposed: the top-down multi-scale encoder and aligned semantic attention.
Clotho: an Audio Captioning Dataset
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results, is presented.
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition
- Computer ScienceIEEE/ACM Transactions on Audio, Speech, and Language Processing
- 2020
This paper proposes pretrained audio neural networks (PANNs) trained on the large-scale AudioSet dataset, and investigates the performance and computational complexity of PANNs modeled by a variety of convolutional neural networks.
Audio Set: An ontology and human-labeled dataset for audio events
- Computer Science2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2017
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.