Neural Audio Captioning Based on Conditional Sequence-to-Sequence Model

@inproceedings{Ikawa2019NeuralAC,
  title={Neural Audio Captioning Based on Conditional Sequence-to-Sequence Model},
  author={Shota Ikawa and Kunio Kashino},
  booktitle={DCASE},
  year={2019}
}
We propose an audio captioning system that describes non-speech audio signals in the form of natural language. Unlike existing systems, this system can generate a sentence describing sounds, rather than an object label or onomatopoeia. This allows the description to include more information, such as how the sound is heard and how the tone or volume changes over time, and can accommodate unknown sounds. A major problem in realizing this capability is that the validity of the description depends… 

Figures and Tables from this paper

AUDIO CAPTIONING BASED ON TRANSFORMER AND PRE-TRAINING FOR 2020 DCASE AUDIO CAPTIONING CHALLENGE Technical Report
TLDR
A sequenceto-sequence model is proposed which consists of a CNN encoder and a Transformer decoder and can achieve a SPIDEr score of 0.227 on audio captioning performance.
A Transformer-based Audio Captioning Model with Keyword Estimation
TLDR
A Transformer-based audio-captioning model with keyword estimation called TRACKE that simultaneously solves the word-selection indeterminacy problem with the main task of AAC while executing the sub-task of acoustic event detection/acoustic scene classification (i.e., keyword estimation).
Audio Captioning Based on Transformer and Pre-Trained CNN
TLDR
This paper proposes a solution of automated audio captioning based on a combination of pre-trained CNN layers and a sequence-to-sequence architecture based on Transformer, which achieves a SPIDEr score of 0.227 for the DCASE challenge 2020 Task 6 with data augmentation and label smoothing applied.
Audio Captioning Transformer
TLDR
An Audio Captioning Transformer (ACT) is proposed, which is a full Transformer network based on an encoder-decoder architecture and is totally convolution-free, which has a better ability to model the global information within an audio signal as well as capture temporal relationships between audio events.
A Comprehensive Survey of Automated Audio Captioning
TLDR
This current paper situates itself as a comprehensive review covering the benchmark datasets, existing deep learning techniques and the evaluation metrics in automated audio captioning.
Automated Audio Captioning: an Overview of Recent Progress and New Challenges
TLDR
A comprehensive review of the published contributions in automated audio captioning, from a variety of existing approaches to evaluation metrics and datasets, and discusses open challenges and envisage possible future research directions.
Audio Captioning using Pre-Trained Large-Scale Language Model Guided by Audio-based Similar Caption Retrieval
TLDR
Experimental results show that the proposed method has succeeded to use a pre- trained language model for audio captioning, and the oracle performance of the pre-trained model-based caption generator was clearly better than that of the conventional method trained from scratch.
MusCaps: Generating Captions for Music Audio
TLDR
This work presents the first music audio captioning model, MusCaps, consisting of an encoder-decoder with temporal attention, which represents a shift away from classificationbased music description and combines tasks requiring both auditory and linguistic understanding to bridge the semantic gap in music information retrieval.
The NTT DCASE2020 Challenge Task 6 system: Automated Audio Captioning with Keywords and Sentence Length Estimation
TLDR
This technical report describes the system participating to the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 Challenge, Task 6: automated audio captioning, and tests a simplified model of the system using the development-testing dataset.
Evaluating Off-the-Shelf Machine Listening and Natural Language Models for Automated Audio Captioning
TLDR
This paper utilizes the freely available Clotho dataset to compare four different pre-trained machine listening models, four word embedding models, and their combinations in many different settings and suggests that YAMNet combined with BERT embeddings produces the best captions.
...
...

References

SHOWING 1-10 OF 25 REFERENCES
Generating Sound Words from Audio Signals of Acoustic Events with Sequence-to-Sequence Model
  • Shota Ikawa, K. Kashino
  • Computer Science
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
TLDR
The method is based on an end-to-end, sequence- to-sequence framework to solve the audio segmentation problem to find an appropriate segment of audio signals along time that corresponds to a sequence of phonemes, and the ambiguity problem, where multiple words may correspond to the same sound, depending on the situations or listeners.
Show and tell: A neural image caption generator
TLDR
This paper presents a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image.
Sequence to Sequence Learning with Neural Networks
TLDR
This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
Acoustic event search with an onomatopoeic query: measuring distance between onomatopoeic words and sounds
TLDR
Both audio signals and onomatopoeias are mapped within the space, allowing us to directly measure the distance between them and confirm that users preferred the audio signals obtained with this approach to those obtained with a text-based similarity search.
Semantically Conditioned LSTM-based Natural Language Generation for Spoken Dialogue Systems
TLDR
A statistical language generator based on a semantically controlled Long Short-term Memory (LSTM) structure that can learn from unaligned data by jointly optimising sentence planning and surface realisation using a simple cross entropy training criterion, and language variation can be easily achieved by sampling from output candidates.
Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019)
TLDR
The proposed SED system is compared against the state of the art mono channel method on the development subset of TUT sound events detection 2016 database and the usage of spatial and harmonic features are shown to improve the performance of SED.
A Persona-Based Neural Conversation Model
TLDR
This work presents persona-based models for handling the issue of speaker consistency in neural response generation that yield qualitative performance improvements in both perplexity and BLEU scores over baseline sequence-to-sequence models.
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
TLDR
Qualitatively, the proposed RNN Encoder‐Decoder model learns a semantically and syntactically meaningful representation of linguistic phrases.
Polyphonic sound event detection using multi label deep neural networks
TLDR
Frame-wise spectral-domain features are used as inputs to train a deep neural network for multi label classification in this work and the proposed method improves the accuracy by 19% percentage points overall.
Recurrent neural network based language model
TLDR
Results indicate that it is possible to obtain around 50% reduction of perplexity by using mixture of several RNN LMs, compared to a state of the art backoff language model.
...
...