Convolutional Recurrent Neural Network and Data Augmentation for Audio Tagging with Noisy Labels and Minimal Supervision

  title={Convolutional Recurrent Neural Network and Data Augmentation for Audio Tagging with Noisy Labels and Minimal Supervision},
  author={Janek Ebbers and Reinhold H{\"a}b-Umbach},
This report presents our Audio Tagging system for the DCASE 2019 Challenge Task 2. [] Key Method Due to the limited amount of available data we use various data augmentation techniques to prevent overfitting and improve generalization. Our best system achieves a label-weighted label-ranking average precision (lwlrap) of 73.0% on the public test set which is an absolute improvement of 19.3% over the baseline.

Figures and Tables from this paper

Multiple Neural Networks with Ensemble Method for Audio Tagging with Noisy Labels and Minimal Supervision
This system uses a sigmoid-softmax activation to deal with so-called sparse multi-label classification and an ensemble method that averages models learned with multiple neural networks and various acoustic features to achieve labelweighted label-ranking average precision scores.
Forward-Backward Convolutional Recurrent Neural Networks and Tag-Conditioned Convolutional Neural Networks for Weakly Labeled Semi-supervised Sound Event Detection
The presented system for the detection and classi-fication of acoustic scenes and events (DCASE) 2020 Challenge and a tag-conditioned CNN tocomplement SED is proposed, trained to predict strong labels while using weak labels, as additional input.
Staged Training Strategy and Multi-Activation for Audio Tagging with Noisy and Sparse Multi-Label Data
This paper proposes a staged training strategy to deal with the noisy label, and adopts a sigmoid-sparsemax multi-activation structure toDeal with the sparse multi-label classification of audio tagging.
Self-Trained Audio Tagging and Sound Event Detection in Domestic Environments
This paper uses a forward-backward convolutional recurrent neural network for tagging and pseudo labeling followed by tag-conditioned sound event detection (SED) models which are trained using strong pseudo labels provided by the FBCRNN and introduces a strong label loss in the objective of the F BCRNN to take advantage of the strongly labeled synthetic data during training.
Comparative Assessment of Data Augmentation for Semi-Supervised Polyphonic Sound Event Detection
This work proposes a CRNN system exploiting unlabeled data with semi-supervised learning based on the “Mean teacher” method, in combination with data augmentation to overcome the limited size of the training dataset and to further improve the performances.
Comparison of Artificial Neural Network Types for Infant Vocalization Classification
A unified neural network architecture scheme for audio classification is defined from which various network types are derived and the most influential architectural hyperparameter for all types were the integration operations for reducing tensor dimensionality between network stages.
Adapting Sound Recognition to A New Environment Via Self-Training
This paper proposes a self-training based domain adaptation approach, which only requires unlabeled data from the target environment on which a student network is trained and shows that the student significantly improves recognition performance over the pre-trained teacher without relying on labeledData from the environment the system is deployed in.
FSD50K: An Open Dataset of Human-Labeled Sound Events
FSD50K is introduced, an open dataset containing over 51 k audio clips totalling over 100 h of audio manually labeled using 200 classes drawn from the AudioSet Ontology, to provide an alternative benchmark dataset and thus foster SER research.
GISE-51: A scalable isolated sound events dataset
This work introduces GISE-51, a dataset spanning 51 isolated sound events belonging to a broad domain of event types, providing an open, reproducible benchmark for future research along with the freedom to adapt the included isolatedsound events for domain-specific applications.


Audio tagging with noisy labels and minimal supervision
This paper presents the task setup, the FSDKaggle2019 dataset prepared for this scientific evaluation, and a baseline system consisting of a convolutional neural network.
Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification
It is shown that the improved performance stems from the combination of a deep, high-capacity model and an augmented training set: this combination outperforms both the proposed CNN without augmentation and a “shallow” dictionary learning model with augmentation.
Learning Sound Event Classifiers from Web Audio with Noisy Labels
Experiments suggest that training with large amounts of noisy data can outperform training with smaller amounts of carefully-labeled data, and it is shown that noise-robust loss functions can be effective in improving performance in presence of corrupted labels.
This technical report describes the proposed design and implementation of the system used for the DCASE 2018 Challenge submission, and proposes data augmentation techniques using shuffling and mixing two sounds in a same class to mitigate the unbalanced training dataset.
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.
Vocal Tract Length Perturbation (VTLP) improves speech recognition
Improvements in speech recognition are suggested without increasing the number of training epochs, and it is suggested that data transformations should be an important component of training neural networks for speech, especially for data limited projects.
Audio Set: An ontology and human-labeled dataset for audio events
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.
Sound Event Detection in Domestic Environments with Weakly Labeled Data and Soundscape Synthesis
The paper introduces Domestic Environment Sound Event Detection (DESED) dataset mixing a part of last year dataset and an additional synthetic, strongly labeled, dataset provided this year that’s described more in detail.
Averaging Weights Leads to Wider Optima and Better Generalization
It is shown that simple averaging of multiple points along the trajectory of SGD, with a cyclical or constant learning rate, leads to better generalization than conventional training, and Stochastic Weight Averaging (SWA) is extremely easy to implement, improves generalization, and has almost no computational overhead.
Generation of Large-Scale Simulated Utterances in Virtual Rooms to Train Deep-Neural Networks for Far-Field Speech Recognition in Google Home
The structure and application of an acoustic room simulator to generate large-scale simulated data for training deep neural networks for far-field speech recognition and performance is evaluated using a factored complex Fast Fourier Transform (CFFT) acoustic model introduced in earlier work.