Local Information Assisted Attention-Free Decoder for Audio Captioning

@article{Xiao2022LocalIA,
  title={Local Information Assisted Attention-Free Decoder for Audio Captioning},
  author={Feiyang Xiao and Jian Guan and Haiyan Lan and Qiaoxi Zhu and Wenwu Wang},
  journal={IEEE Signal Processing Letters},
  year={2022},
  volume={29},
  pages={1604-1608}
}
Automated audio captioning aims to describe audio data with captions using natural language. Existing methods often employ an encoder-decoder structure, where the attention-based decoder (e.g., Transformer decoder) is widely used and achieves state-of-the-art performance. Although this method effectively captures global information within audio data via the self-attention mechanism, it may ignore the event with short time duration, due to its limitation in capturing local information in an… 

Figures and Tables from this paper

Automated Audio Captioning: an Overview of Recent Progress and New Challenges

TLDR
A comprehensive review of the published contributions in automated audio captioning, from a variety of existing approaches to evaluation metrics and datasets, and discusses open challenges and envisage possible future research directions.

References

SHOWING 1-10 OF 33 REFERENCES

Audio Captioning Transformer

TLDR
An Audio Captioning Transformer (ACT) is proposed, which is a full Transformer network based on an encoder-decoder architecture and is totally convolution-free, which has a better ability to model the global information within an audio signal as well as capture temporal relationships between audio events.

Leveraging Pre-trained BERT for Audio Captioning

TLDR
PANNs is applied as the encoder and initialize the decoder from the publicly available pre-trained BERT models for audio captioning, and these models achieve competitive results with the existingaudio captioning methods on the AudioCaps dataset.

An Encoder-Decoder Based Audio Captioning System With Transfer and Reinforcement Learning

TLDR
The results show that the proposed techniques significantly improve the scores of the evaluation metrics, however, reinforcement learning may impact adversely on the quality of the generated captions.

CL4AC: A Contrastive Loss for Audio Captioning

TLDR
In CL4AC, the self-supervision signals derived from the original audio-text paired data are used to exploit the correspondences between audio and texts by contrasting samples, which can improve the quality of latent representation and the alignment betweenaudio and texts, while trained with limited data.

AUDIO CAPTIONING WITH MESHED-MEMORY TRANSFORMER Technical Report

TLDR
A sequence-to-sequence model is proposed which consists of a CNN-based encoder, a memory-augmented refiner and a meshed decoder and refines a multi-level representation of the relationships between audio features integrating learned a priori knowledge.

Diverse Audio Captioning Via Adversarial Training

TLDR
This work proposes an adversarial training framework for audio captioning based on a conditional generative adversarial network (C-GAN), which aims at improving the naturalness and diversity of generated captions.

Automated Audio Captioning: an Overview of Recent Progress and New Challenges

TLDR
A comprehensive review of the published contributions in automated audio captioning, from a variety of existing approaches to evaluation metrics and datasets, and discusses open challenges and envisage possible future research directions.

Leveraging State-of-the-art ASR Techniques to Audio Captioning

TLDR
Experimental results indicate that the trained models significantly outperform the baseline system from DCASE 2021 challenge task 6.

Automated audio captioning with recurrent neural networks

TLDR
Results from metrics show that the proposed method can predict words appearing in the original caption, but not always correctly ordered.

A Transformer-based Audio Captioning Model with Keyword Estimation

TLDR
A Transformer-based audio-captioning model with keyword estimation called TRACKE that simultaneously solves the word-selection indeterminacy problem with the main task of AAC while executing the sub-task of acoustic event detection/acoustic scene classification (i.e., keyword estimation).