Audio Caption: Listen and Tell

  title={Audio Caption: Listen and Tell},
  author={Mengyue Wu and Heinrich Dinkel and Kai Yu},
  journal={ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  • Mengyue Wu, Heinrich Dinkel, Kai Yu
  • Published 25 February 2019
  • Computer Science
  • ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Increasing amount of research has shed light on machine perception of audio events, most of which concerns detection and classification tasks. [] Key Method A baseline encoder-decoder model is provided for both English and Mandarin. Similar BLEU scores are derived for both languages: our model can generate understandable and data-related captions based on the dataset.

Figures and Tables from this paper

Audio Caption in a Car Setting with a Sentence-Level Loss

A sentence-level loss is proposed to be used in tandem with a GRU encoder-decoder model to generate captions with higher semantic similarity to human annotations to improve audio captioning within a car scene.

Automated Audio Captioning using Audio Event Clues

Results of the extensive experiments show that using audio event labels with the acoustic features improve the recognition performance and the proposed method either outperforms or achieves competitive results with the state- of-the-art models.

What does a Car-ssette tape tell?

This paper contributes a manually-annotated dataset on car scene, in extension to a previously published hospital audio captioning dataset, and makes an effort to provide a better objective evaluation metric, namely the BERT similarity score.

Text-to-Audio Grounding: Building Correspondence Between Captions and Sound Events

An Audio-Grounding dataset is contributed, which provides the correspondence be-tween sound events and the captions provided in Audiocaps, along with the location (timestamps) of each present sound event.

A Comprehensive Survey of Automated Audio Captioning

This current paper situates itself as a comprehensive review covering the benchmark datasets, existing deep learning techniques and the evaluation metrics in automated audio captioning.

Evaluating Off-the-Shelf Machine Listening and Natural Language Models for Automated Audio Captioning

This paper utilizes the freely available Clotho dataset to compare four different pre-trained machine listening models, four word embedding models, and their combinations in many different settings and suggests that YAMNet combined with BERT embeddings produces the best captions.

Audio Captioning using Gated Recurrent Units

A novel deep network architecture with audio embeddings is presented to predict audio captions and the experimental results show that the proposed BiGRU-based deep model outperforms the state of the art results.

Multi-task Regularization Based on Infrequent Classes for Audio Captioning

This paper proposes two methods to mitigate the class imbalance problem in an autoencoder setting for audio captioning, and defines a multi-label side task based on clip-level content word detection by training a separate decoder.


A method based on modified encoder-decoder architecture for the automated audio captioning task and the impact of augmentations (MixUp, Reverb, Pitch, Over-drive, Speed) on method performance is examined.

Audio Captioning Transformer

An Audio Captioning Transformer (ACT) is proposed, which is a full Transformer network based on an encoder-decoder architecture and is totally convolution-free, which has a better ability to model the global information within an audio signal as well as capture temporal relationships between audio events.



Audio Set: An ontology and human-labeled dataset for audio events

The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.

Large-Scale Weakly Supervised Audio Classification Using Gated Convolutional Neural Network

In this paper, we present a gated convolutional neural network and a temporal attention-based localization method for audio classification, which won the 1st place in the large-scale weakly

Automated audio captioning with recurrent neural networks

Results from metrics show that the proposed method can predict words appearing in the original caption, but not always correctly ordered.

Show and tell: A neural image caption generator

This paper presents a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image.

Weakly Supervised Dense Video Captioning

This paper focuses on a novel and challenging vision task, dense video captioning, which aims to automatically describe a video clip with multiple informative and diverse caption sentences, and proposes lexical fully convolutional neural networks with weakly supervised multi-instance multi-label learning to weakly link video regions with lexical labels.

Adaptive Pooling Operators for Weakly Labeled Sound Event Detection

This paper treats SED as a multiple instance learning (MIL) problem, where training labels are static over a short excerpt, indicating the presence or absence of sound sources but not their temporal locality, and develops a family of adaptive pooling operators—referred to as autopool—which smoothly interpolate between common pooling Operators, and automatically adapt to the characteristics of the sound sources in question.

Multi-Task Video Captioning with Video and Entailment Generation

This work improves video captioning by sharing knowledge with two related directed-generation tasks: a temporally-directed unsupervised video prediction task to learn richer context-aware video encoder representations, and a logically-directed language entailment generation task to learning better video-entailing caption decoder representations.

Adding Chinese Captions to Images

The study reveals to some extent that a computer can master two distinct languages, English and Chinese, at a similar level for describing the visual world.

Jointly Modeling Embedding and Translation to Bridge Video and Language

A novel unified framework, named Long Short-Term Memory with visual-semantic Embedding (LSTM-E), which can simultaneously explore the learning of LSTM and visual- semantic embedding and outperforms several state-of-the-art techniques in predicting Subject-Verb-Object (SVO) triplets.

CNN architectures for large-scale audio classification

This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.