A Transformer-based Audio Captioning Model with Keyword Estimation

@article{Koizumi2020ATA,
  title={A Transformer-based Audio Captioning Model with Keyword Estimation},
  author={Yuma Koizumi and Ryo Masumura and Kyosuke Nishida and Masahiro Yasuda and Shoichiro Saito},
  journal={ArXiv},
  year={2020},
  volume={abs/2007.00222}
}
One of the problems with automated audio captioning (AAC) is the indeterminacy in word selection corresponding to the audio event/scene. Since one acoustic event/scene can be described with several words, it results in a combinatorial explosion of possible captions and difficulty in training. To solve this problem, we propose a Transformer-based audio-captioning model with keyword estimation called TRACKE. It simultaneously solves the word-selection indeterminacy problem with the main task of… 

Figures from this paper

A Comprehensive Survey of Automated Audio Captioning
TLDR
This current paper situates itself as a comprehensive review covering the benchmark datasets, existing deep learning techniques and the evaluation metrics in automated audio captioning.
Automated Audio Captioning: an Overview of Recent Progress and New Challenges
TLDR
This paper presents a comprehensive review of the published contributions in automated audio captioning, from a variety of existing approaches to evaluation metrics and datasets, and discusses open challenges and envisage possible future research directions.
Evaluating Off-the-Shelf Machine Listening and Natural Language Models for Automated Audio Captioning
TLDR
This paper utilizes the freely available Clotho dataset to compare four different pre-trained machine listening models, four word embedding models, and their combinations in many different settings and suggests that YAMNet combined with BERT embeddings produces the best captions.
The NTT DCASE2020 Challenge Task 6 system: Automated Audio Captioning with Keywords and Sentence Length Estimation
TLDR
This technical report describes the system participating to the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 Challenge, Task 6: automated audio captioning, and tests a simplified model of the system using the development-testing dataset.
Automated Audio Captioning using Audio Event Clues
TLDR
Results of the extensive experiments show that using audio event labels with the acoustic features improve the recognition performance and the proposed method either outperforms or achieves competitive results with the state- of-the-art models.
Automatic Audio Captioning using Attention weighted Event based Embeddings
TLDR
An encoder-decoder architecture with light-weight Bi-LSTM recurrent layers for AAC and evidence of the ability of the non-uniform attention weighted encoding generated as a part of this model to facilitate the decoder glance over specific sections of the audio while generating each token.
BYOL for Audio: Exploring Pre-trained General-purpose Audio Representations
TLDR
It is hypothesized that representations effective for general audio tasks should provide multiple aspects of robust features of the input sound, and a self-supervised learning method is proposed: Bootstrap Your Own Latent for Audio (BYOL-A, pronounced ”viola”).
Caption Feature Space Regularization for Audio Captioning
TLDR
A two-stage framework for audio captioning is proposed that constructs a proxy feature space to reduce the distances between captions correlated to the same audio, and in the second stage, theproxy feature space is utilized as additional supervision to encourage the model to be optimized in the direction that benefits all the correlated captions.
Leveraging Pre-trained BERT for Audio Captioning
TLDR
PANNs is applied as the encoder and initialize the decoder from the publicly available pre-trained BERT models for audio captioning, and these models achieve competitive results with the existingaudio captioning methods on the AudioCaps dataset.
Local Information Assisted Attention-free Decoder for Audio Captioning
TLDR
This paper presents an AAC method with an attention-free decoder, where an encoder based on PANNs is employed for audio feature extraction, and the attention- free decoder is designed to introduce local information.
...
1
2
3
...

References

SHOWING 1-10 OF 38 REFERENCES
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Clotho: an Audio Captioning Dataset
TLDR
Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results, is presented.
AudioCaps: Generating Captions for Audios in The Wild
TLDR
A large-scale dataset of 46K audio clips with human-written text pairs collected via crowdsourcing on the AudioSet dataset is contributed and two novel components that help improve audio captioning performance are proposed: the top-down multi-scale encoder and aligned semantic attention.
CNN Architectures for LargeScale Audio Classification
  • Proc. of Int’l Conf. on Acoust., Speech, and Signal Process. (ICASSP), 2017.
  • 2017
Metrics for Polyphonic Sound Event Detection
This paper presents and discusses various metrics proposed for evaluation of polyphonic sound event detection systems used in realistic situations where there are typically multiple sound sources
Acoustic Scene Classification: Classifying environments from the sounds they produce
TLDR
An account of the state of the art in acoustic scene classification (ASC), the task of classifying environments from the sounds they produce, and a range of different algorithms submitted for a data challenge to provide a general and fair benchmark for ASC techniques.
Polyphonic sound event detection using multi label deep neural networks
TLDR
Frame-wise spectral-domain features are used as inputs to train a deep neural network for multi label classification in this work and the proposed method improves the accuracy by 19% percentage points overall.
Acoustic event detection in real life recordings
TLDR
A system for acoustic event detection in recordings from real life environments using a network of hidden Markov models, capable of recognizing almost one third of the events, and the temporal positioning of the Events is not correct for 84% of the time.
Sound Event Detection by Multitask Learning of Sound Events and Scenes with Soft Scene Labels
TLDR
A new method for SED based on MTL of SED and ASC using the soft labels of acoustic scenes is proposed, which enable us to model the extent to which sound events and scenes are related.
Abstractive Summarization with Combination of Pre-trained Sequence-to-Sequence and Saliency Models
TLDR
This study investigated the effectiveness of combining saliency models that identify the important parts of the source text with the pre-trained seq-to-seq models through extensive experiments and proposed a new combination model consisting of a saliency model that extracts a token sequence from a source text and a seq- to-seq model that takes the sequence as an additional input text.
...
1
2
3
4
...