A Transformer-based Audio Captioning Model with Keyword Estimation

@article{Koizumi2020ATA,
  title={A Transformer-based Audio Captioning Model with Keyword Estimation},
  author={Yuma Koizumi and Ryo Masumura and Kyosuke Nishida and Masahiro Yasuda and Shoichiro Saito},
  journal={ArXiv},
  year={2020},
  volume={abs/2007.00222}
}
One of the problems with automated audio captioning (AAC) is the indeterminacy in word selection corresponding to the audio event/scene. Since one acoustic event/scene can be described with several words, it results in a combinatorial explosion of possible captions and difficulty in training. To solve this problem, we propose a Transformer-based audio-captioning model with keyword estimation called TRACKE. It simultaneously solves the word-selection indeterminacy problem with the main task of… 

Figures from this paper

A Comprehensive Survey of Automated Audio Captioning

This current paper situates itself as a comprehensive review covering the benchmark datasets, existing deep learning techniques and the evaluation metrics in automated audio captioning.

Automated audio captioning: an overview of recent progress and new challenges

A comprehensive review of the published contributions in automated audio captioning, from a variety of existing approaches to evaluation metrics and datasets is presented.

Evaluating Off-the-Shelf Machine Listening and Natural Language Models for Automated Audio Captioning

This paper utilizes the freely available Clotho dataset to compare four different pre-trained machine listening models, four word embedding models, and their combinations in many different settings and suggests that YAMNet combined with BERT embeddings produces the best captions.

The NTT DCASE2020 Challenge Task 6 system: Automated Audio Captioning with Keywords and Sentence Length Estimation

This technical report describes the system participating to the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 Challenge, Task 6: automated audio captioning, and tests a simplified model of the system using the development-testing dataset.

Investigations in Audio Captioning: Addressing Vocabulary Imbalance and Evaluating Suitability of Language-Centric Performance Metrics

This work identifies and improves on three main challenges in automated audio captioning: i) data scarcity, ii) imbalance or limitations in the audio captions vocabulary, and iii) the proper performance evaluation metric that can best capture both auditory and semantic characteristics.

Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention

This work proposes visually-aware audio captioning, which makes use of visual information to help the recognition of ambiguous sounding objects, and proposes an audio-visual attention mechanism that inte-grates audio and visual information adaptively according to their confidence levels.

GCT: Gated Contextual Transformer for Sequential Audio Tagging

A gated contextual Transformer (GCT) with forward-backward inference (FBI) block is proposed in GCT to improve the performance of cTransformer structurally and to promote research on SAT, the manually annotated sequential labels for the two datasets are released.

Automated Audio Captioning via Fusion of Low- and High- Dimensional Features

A probabilistic fusion approach is proposed that can ensure the overall performance of the system is improved by concentrating on the respective advantages of the two transformer decoders and achieves the best performance on the Clotho and AudioCaps datasets.

iCNN-Transformer: An improved CNN-Transformer with Channel-spatial Attention and Keyword Prediction for Automated Audio Captioning

The results show that the proposed approach to guide the generation of captioning by multi-level information extracted from audio clip can significantly improve the scores of various evaluation metrics and achieve the state-of-the-art performance in the Cross-entropy training stage.

Event-related data conditioning for acoustic event classification

It is shown that self-attention may over-enhance certain segments of audio representations, and smooth out the boundaries between events representations and background noises.

References

SHOWING 1-10 OF 38 REFERENCES

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Clotho: an Audio Captioning Dataset

Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results, is presented.

AudioCaps: Generating Captions for Audios in The Wild

A large-scale dataset of 46K audio clips with human-written text pairs collected via crowdsourcing on the AudioSet dataset is contributed and two novel components that help improve audio captioning performance are proposed: the top-down multi-scale encoder and aligned semantic attention.

Metrics for Polyphonic Sound Event Detection

This paper presents and discusses various metrics proposed for evaluation of polyphonic sound event detection systems used in realistic situations where there are typically multiple sound sources

Polyphonic sound event detection using multi label deep neural networks

Frame-wise spectral-domain features are used as inputs to train a deep neural network for multi label classification in this work and the proposed method improves the accuracy by 19% percentage points overall.

Acoustic event detection in real life recordings

A system for acoustic event detection in recordings from real life environments using a network of hidden Markov models, capable of recognizing almost one third of the events, and the temporal positioning of the Events is not correct for 84% of the time.

Acoustic Scene Classification: Classifying environments from the sounds they produce

An account of the state of the art in acoustic scene classification (ASC), the task of classifying environments from the sounds they produce, and a range of different algorithms submitted for a data challenge to provide a general and fair benchmark for ASC techniques.

CNN Architectures for LargeScale Audio Classification

  • Proc. of Int’l Conf. on Acoust., Speech, and Signal Process. (ICASSP), 2017.
  • 2017

Sound Event Detection by Multitask Learning of Sound Events and Scenes with Soft Scene Labels

A new method for SED based on MTL of SED and ASC using the soft labels of acoustic scenes is proposed, which enable us to model the extent to which sound events and scenes are related.

The NTT DCASE2020 Challenge Task 6 system: Automated Audio Captioning with Keywords and Sentence Length Estimation

This technical report describes the system participating to the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 Challenge, Task 6: automated audio captioning, and tests a simplified model of the system using the development-testing dataset.