• Corpus ID: 221879035

Effects of Word-frequency based Pre- and Post- Processings for Audio Captioning

@inproceedings{Takeuchi2020EffectsOW,
  title={Effects of Word-frequency based Pre- and Post- Processings for Audio Captioning},
  author={Daiki Takeuchi and Yuma Koizumi and Yasunori Ohishi and Noboru Harada and Kunio Kashino},
  booktitle={DCASE},
  year={2020}
}
The system we used for Task 6 (Automated Audio Captioning)of the Detection and Classification of Acoustic Scenes and Events(DCASE) 2020 Challenge combines three elements, namely, dataaugmentation, multi-task learning, and post-processing, for audiocaptioning. The system received the highest evaluation scores, butwhich of the individual elements most fully contributed to its perfor-mance has not yet been clarified. Here, to asses their contributions,we first conducted an element-wise ablation… 

Figures and Tables from this paper

Evaluating Off-the-Shelf Machine Listening and Natural Language Models for Automated Audio Captioning
TLDR
This paper utilizes the freely available Clotho dataset to compare four different pre-trained machine listening models, four word embedding models, and their combinations in many different settings and suggests that YAMNet combined with BERT embeddings produces the best captions.
Automated Audio Captioning with Weakly Supervised Pre-Training and Word Selection Methods
TLDR
This paper proposes a solution of automated audio captioning based on weakly supervised pre-training and word selection methods that achieves the best SPIDEr score of 0.310 in the DCASE 2021 Challenge Task 6.
WaveTransformer: An Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information
TLDR
This work presents a novel AAC method, explicitly focused on the exploitation of the temporal and time-frequency patterns in audio, utilizing the widely used Transformer decoder and utilizing the freely available splits of the Clotho dataset.
An Encoder-Decoder Based Audio Captioning System With Transfer and Reinforcement Learning
TLDR
The results show that the proposed techniques significantly improve the scores of the evaluation metrics, however, reinforcement learning may impact adversely on the quality of the generated captions.
Audio Captioning using Pre-Trained Large-Scale Language Model Guided by Audio-based Similar Caption Retrieval
TLDR
Experimental results show that the proposed method has succeeded to use a pre- trained language model for audio captioning, and the oracle performance of the pre-trained model-based caption generator was clearly better than that of the conventional method trained from scratch.
Leveraging State-of-the-art ASR Techniques to Audio Captioning
TLDR
Experimental results indicate that the trained models significantly outperform the baseline system from DCASE 2021 challenge task 6.
Continual Learning for Automated Audio Captioning Using The Learning Without Forgetting Approach
TLDR
This paper presents a first approach for continuously adapting an AAC method to new information, using a continual learning method, and achieves a good balance between distilling new knowledge and not forgetting the previous one.
Audio Captioning Transformer
TLDR
An Audio Captioning Transformer (ACT) is proposed, which is a full Transformer network based on an encoder-decoder architecture and is totally convolution-free, which has a better ability to model the global information within an audio signal as well as capture temporal relationships between audio events.
A Comprehensive Survey of Automated Audio Captioning
TLDR
This current paper situates itself as a comprehensive review covering the benchmark datasets, existing deep learning techniques and the evaluation metrics in automated audio captioning.
Automated Audio Captioning: an Overview of Recent Progress and New Challenges
TLDR
This paper presents a comprehensive review of the published contributions in automated audio captioning, from a variety of existing approaches to evaluation metrics and datasets, and discusses open challenges and envisage possible future research directions.
...
1
2
...

References

SHOWING 1-10 OF 31 REFERENCES
The NTT DCASE2020 Challenge Task 6 system: Automated Audio Captioning with Keywords and Sentence Length Estimation
TLDR
This technical report describes the system participating to the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 Challenge, Task 6: automated audio captioning, and tests a simplified model of the system using the development-testing dataset.
Pre-training via Paraphrasing
TLDR
It is shown that fine-tuning gives strong performance on a range of discriminative and generative tasks in many languages, making MARGE the most generally applicable pre-training method to date.
Neural Audio Captioning Based on Conditional Sequence-to-Sequence Model
TLDR
An audio captioning system that describes non-speech audio signals in the form of natural language that can generate a sentence describing sounds, rather than an object label or onomatopoeia, is proposed.
Audio Caption: Listen and Tell
  • Mengyue Wu, Heinrich Dinkel, Kai Yu
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
TLDR
A manually-annotated dataset for audio caption is introduced to automatically generate natural sentences for audio scene description and to bridge the gap between machine perception of audio and image.
Automated audio captioning with recurrent neural networks
TLDR
Results from metrics show that the proposed method can predict words appearing in the original caption, but not always correctly ordered.
Clotho: an Audio Captioning Dataset
TLDR
Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results, is presented.
AudioCaps: Generating Captions for Audios in The Wild
TLDR
A large-scale dataset of 46K audio clips with human-written text pairs collected via crowdsourcing on the AudioSet dataset is contributed and two novel components that help improve audio captioning performance are proposed: the top-down multi-scale encoder and aligned semantic attention.
Augmenting Data with Mixup for Sentence Classification: An Empirical Study
TLDR
Two strategies for the adaption of Mixup on sentence classification are proposed: one performs interpolation on word embeddings and another on sentence embedDings, and both serve as an effective, domain independent data augmentation approach for sentence classification.
Crowdsourcing a Dataset of Audio Captions
TLDR
A three steps based framework for crowdsourcing an audio captioning dataset is presented, based on concepts and practises followed for the creation of widely used image captioning and machine translations datasets, and results show that the resulting dataset has less typographical errors than the initial captions.
Guided Open Vocabulary Image Captioning with Constrained Beam Search
TLDR
This work uses constrained beam search to force the inclusion of selected tag words in the output, and fixed, pretrained word embeddings to facilitate vocabulary expansion to previously unseen tag words to achieve state of the art results for out-of- domain captioning on MSCOCO (and improved results for in-domain captioning).
...
1
2
3
4
...