• Corpus ID: 238856787

Evaluating Off-the-Shelf Machine Listening and Natural Language Models for Automated Audio Captioning

@inproceedings{Weck2021EvaluatingOM,
  title={Evaluating Off-the-Shelf Machine Listening and Natural Language Models for Automated Audio Captioning},
  author={Benno Weck and Xavier Favory and Konstantinos Drossos and Xavier Serra},
  booktitle={Workshop on Detection and Classification of Acoustic Scenes and Events},
  year={2021}
}
Automated audio captioning (AAC) is the task of automatically generating textual descriptions for general audio signals. A captioning system has to identify various information from the input signal and express it with natural language. Existing works mainly focus on investigating new methods and try to improve their performance measured on existing datasets. Having attracted attention only recently, very few works on AAC study the performance of existing pre-trained audio and natural language… 

Figures and Tables from this paper

Leveraging Pre-trained BERT for Audio Captioning

This study applies PANNs as the encoder and initialize the decoder from the publicly available pre-trained BERT models for audio captioning, and achieves competitive results with the existingaudio captioning methods on the AudioCaps dataset.

Automated Audio Captioning using Audio Event Clues

Results of the extensive experiments show that using audio event labels with the acoustic features improve the recognition performance and the proposed method either outperforms or achieves competitive results with the state- of-the-art models.

Matching Text and Audio Embeddings: Exploring Transfer-Learning Strategies for Language-Based Audio Retrieval

An analysis of large-scale pretrained deep learning models used for cross-modal (text-to-audio) retrieval and the proper choice of the loss function and fine-tuning the pretrained models are essential in training a competitive retrieval system are revealed.

Automated Audio Captioning and Language-Based Audio Retrieval

This project involved participation in the DCASE 2022 Competition (Task 6) which had two subtasks: (1) Automated Audio Captioning and (2) Language-Based Audio Retrieval, and the architecture for Automated audio Captioning is close to the baseline performance, while the model for Language- based audio Retrival has surpassed its counterpart.

Automated audio captioning: an overview of recent progress and new challenges

A comprehensive review of the published contributions in automated audio captioning, from a variety of existing approaches to evaluation metrics and datasets is presented.

Separate What You Describe: Language-Queried Audio Source Separation

This paper proposes LASS-Net, an end-to-end neural network that is learned to jointly process acoustic and linguistic information, and separate the target source that is consistent with the language query from an audio mixture.

References

SHOWING 1-10 OF 39 REFERENCES

Audio Captioning Based on Combined Audio and Semantic Embeddings

Experimental results show that the proposed BiGRU-based deep model significantly outperforms the state of the art results across different evaluation metrics and inclusion of semantic information enhance the captioning performance.

Multi-task Regularization Based on Infrequent Classes for Audio Captioning

This paper proposes two methods to mitigate the class imbalance problem in an autoencoder setting for audio captioning, and defines a multi-label side task based on clip-level content word detection by training a separate decoder.

Audio Captioning using Pre-Trained Large-Scale Language Model Guided by Audio-based Similar Caption Retrieval

Experimental results show that the proposed method has succeeded to use a pre- trained language model for audio captioning, and the oracle performance of the pre-trained model-based caption generator was clearly better than that of the conventional method trained from scratch.

THE DCASE 2021 CHALLENGE TASK 6 SYSTEM: AUTOMATED AUDIO CAPTIONING WITH WEAKLY SUPERVISED PRE-TRAING AND WORD SELECTION METHODS Technical Report

This technical report describes the system participating to the Detection and Classification of Acoustic Scenes and Events (DCASE) 2021 Challenge, Task 6: automated audio captioning. We use

Audio Caption: Listen and Tell

  • Mengyue WuHeinrich DinkelKai Yu
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
A manually-annotated dataset for audio caption is introduced to automatically generate natural sentences for audio scene description and to bridge the gap between machine perception of audio and image.

Audio Captioning using Gated Recurrent Units

A novel deep network architecture with audio embeddings is presented to predict audio captions and the experimental results show that the proposed BiGRU-based deep model outperforms the state of the art results.

AUDIO CAPTIONING BASED ON TRANSFORMER AND PRE-TRAINING FOR 2020 DCASE AUDIO CAPTIONING CHALLENGE Technical Report

A sequenceto-sequence model is proposed which consists of a CNN encoder and a Transformer decoder and can achieve a SPIDEr score of 0.227 on audio captioning performance.

Investigating Local and Global Information for Automated Audio Captioning with Transfer Learning

A topic model for audio descriptions is proposed, comprehensively analyzing the hierarchical audio topics that are commonly covered and it is discovered that local information and abstract representation learning are more crucial to AAC than global information and temporal relationship learning.

Temporal Sub-sampling of Audio Feature Sequences for Automated Audio Captioning

This work presents an approach that focuses on explicitly taking advantage of the difference of lengths between sequences, by applying a temporal sub-sampling to the audio input sequence by employing a sequence-to-sequence method.

COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations

The results are promising, sometimes in par with the state-of-the-art in the considered tasks and the embeddings produced with the method are well correlated with some acoustic descriptors.