Automated audio captioning with recurrent neural networks

@article{Drossos2017AutomatedAC,
  title={Automated audio captioning with recurrent neural networks},
  author={Konstantinos Drossos and Sharath Adavanne and Tuomas Virtanen},
  journal={2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
  year={2017},
  pages={374-378}
}
We present the first approach to automated audio captioning. We employ an encoder-decoder scheme with an alignment model in between. The input to the encoder is a sequence of log mel-band energies calculated from an audio file, while the output is a sequence of words, i.e. a caption. The encoder is a multi-layered, bi-directional gated recurrent unit (GRU) and the decoder a multi-layered GRU with a classification layer connected to the last GRU of the decoder. The classification layer and the… 

Figures and Tables from this paper

AUTOMATED AUDIO CAPTIONING
TLDR
A method based on modified encoder-decoder architecture for the automated audio captioning task and the impact of augmentations (MixUp, Reverb, Pitch, Over-drive, Speed) on method performance is examined.
Audio Captioning using Gated Recurrent Units
TLDR
A novel deep network architecture with audio embeddings is presented to predict audio captions and the experimental results show that the proposed BiGRU-based deep model outperforms the state of the art results.
Audio Captioning Transformer
TLDR
An Audio Captioning Transformer (ACT) is proposed, which is a full Transformer network based on an encoder-decoder architecture and is totally convolution-free, which has a better ability to model the global information within an audio signal as well as capture temporal relationships between audio events.
AUTOMATIC AUDIO CAPTIONING SYSTEM BASED ON CONVOLUTIONAL NEURAL NETWORK Technical Report
TLDR
A neural network with CNN as the encoder and GRU as the decoder proves that the application of CNN can be a choice for automated audio captioning, and decreases the training time significantly.
Automatic Audio Captioning using Attention weighted Event based Embeddings
TLDR
An encoder-decoder architecture with light-weight Bi-LSTM recurrent layers for AAC and evidence of the ability of the non-uniform attention weighted encoding generated as a part of this model to facilitate the decoder glance over specific sections of the audio while generating each token.
Listen Carefully and Tell: An Audio Captioning System Based on Residual Learning and Gammatone Audio Representation
TLDR
This work proposes an automatic audio captioning based on residual learning on the encoder phase that surpasses the baseline system in challenge results.
Audio Captioning Based on Transformer and Pre-Trained CNN
TLDR
This paper proposes a solution of automated audio captioning based on a combination of pre-trained CNN layers and a sequence-to-sequence architecture based on Transformer, which achieves a SPIDEr score of 0.227 for the DCASE challenge 2020 Task 6 with data augmentation and label smoothing applied.
Temporal Sub-sampling of Audio Feature Sequences for Automated Audio Captioning
TLDR
This work presents an approach that focuses on explicitly taking advantage of the difference of lengths between sequences, by applying a temporal sub-sampling to the audio input sequence by employing a sequence-to-sequence method.
Leveraging Pre-trained BERT for Audio Captioning
TLDR
PANNs is applied as the encoder and initialize the decoder from the publicly available pre-trained BERT models for audio captioning, and these models achieve competitive results with the existingaudio captioning methods on the AudioCaps dataset.
An Encoder-Decoder Based Audio Captioning System With Transfer and Reinforcement Learning
TLDR
The results show that the proposed techniques significantly improve the scores of the evaluation metrics, however, reinforcement learning may impact adversely on the quality of the generated captions.
...
...

References

SHOWING 1-10 OF 23 REFERENCES
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
TLDR
Qualitatively, the proposed RNN Encoder‐Decoder model learns a semantically and syntactically meaningful representation of linguistic phrases.
Show and tell: A neural image caption generator
TLDR
This paper presents a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image.
Translating Videos to Natural Language Using Deep Recurrent Neural Networks
TLDR
This paper proposes to translate videos directly to sentences using a unified deep neural network with both convolutional and recurrent structure, to create sentence descriptions of open-domain videos with large vocabularies.
Beyond caption to narrative: Video captioning with multiple sentences
TLDR
This work attempts to generate video captions that convey richer contents by temporally segmenting the video with action localization, generating multiple captions from multiple frames, and connecting them with natural language processing techniques in order to generate a story-like caption.
Neural Machine Translation by Jointly Learning to Align and Translate
TLDR
It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.
Image Captioning with Semantic Attention
TLDR
This paper proposes a new algorithm that combines top-down and bottom-up approaches to natural language description through a model of semantic attention, and significantly outperforms the state-of-the-art approaches consistently across different evaluation metrics.
Automatic audio tagging using covariate shift adaptation
TLDR
This work uses a specially designed audio similarity measure as input to a set of weighted logistic regressors, which attempt to alleviate the influence of covariate shift in the acoustic feature space.
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
TLDR
GNMT, Google's Neural Machine Translation system, is presented, which attempts to address many of the weaknesses of conventional phrase-based translation systems and provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delicited models.
Sound event detection using spatial features and convolutional recurrent neural network
TLDR
This paper proposes to use low-level spatial features extracted from multichannel audio for sound event detection and shows that instead of concatenating the features of each channel into a single feature vector the network learns sound events in multich channel audio better when they are presented as separate layers of a volume.
Adam: A Method for Stochastic Optimization
TLDR
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
...
...