• Corpus ID: 212670663

General-purpose audio tagging by ensembling convolutional neural networks based on multiple features

  title={General-purpose audio tagging by ensembling convolutional neural networks based on multiple features},
  author={Kevin Wilkinghoff},
This paper describes an audio tagging system that participated in Task 2 “General-purpose audio tagging of Freesound content with AudioSet labels” of the “Detection and Classification of Acoustic Scenes and Events (DCASE)” Challenge 2018. The system is an ensemble consisting of five convolutional neural networks based on Mel-frequency Cepstral Coefficients, Perceptual Linear Prediction features, Mel-spectrograms and the raw audio data. For ensembling all models, score-based fusion via Logistic… 

Figures and Tables from this paper

Audio Tagging by Cross Filtering Noisy Labels
This article presents a novel framework, named CrossFilter, to combat the noisy labels problem for audio tagging, and achieves state-of-the-art performance and even surpasses the ensemble models on FSDKaggle2018 dataset.
On the performance of different excitation-residual blocks for Acoustic Scene Classification
Two novel squeeze-excitation blocks are proposed to improve the accuracy of an ASC framework by modifying the architecture of the residual block in a CNN together with an analysis of several state-of-the-art blocks.
On Open-Set Classification with L3-Net Embeddings for Machine Listening Applications
  • Kevin Wilkinghoff
  • Computer Science
    2020 28th European Signal Processing Conference (EUSIPCO)
  • 2021
A neural network that combines all L3-Net embeddings belonging to one recording into a single vector by using an x-vector mechanism as well as an open-set classification system based on that are presented.
Audio-Based Epileptic Seizure Detection
This paper investigates automatic epileptic seizure detection from audio recordings using convolutional neural networks, and treats all seizure vocalizations as a single target event class, and models the seizure detection problem in terms of detecting the target vs non-target classes.


CNN architectures for large-scale audio classification
This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.
Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Detection
This work introduces a convolutional neural network (CNN) with a large input field for AED that significantly outperforms state of the art methods including Bag of Audio Words (BoAW) and classical CNNs, achieving a 16% absolute improvement.
Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification
It is shown that the improved performance stems from the combination of a deep, high-capacity model and an augmented training set: this combination outperforms both the proposed CNN without augmentation and a “shallow” dictionary learning model with augmentation.
General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline
The goal of the task is to build an audio tagging system that can recognize the category of an audio clip from a subset of 41 diverse categories drawn from the AudioSet Ontology.
Recognition of acoustic events using deep neural networks
For an acoustic event classification task containing 61 distinct classes, classification accuracy of the neural network classifier excels that of the conventional Gaussian mixture model based hidden Markov model classifier.
Improved Regularization of Convolutional Neural Networks with Cutout
This paper shows that the simple regularization technique of randomly masking out square regions of input during training, which is called cutout, can be used to improve the robustness and overall performance of convolutional neural networks.
Audio Set: An ontology and human-labeled dataset for audio events
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.
Rethinking the Inception Architecture for Computer Vision
This work is exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization.
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
Dropout: a simple way to prevent neural networks from overfitting
It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.