Training Neural Audio Classifiers with Few Data

@article{Pons2019TrainingNA,
  title={Training Neural Audio Classifiers with Few Data},
  author={Jordi Pons and Joan Serr{\`a} and Xavier Serra},
  journal={ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2019},
  pages={16-20}
}
  • Jordi Pons, J. Serrà, X. Serra
  • Published 24 October 2018
  • Computer Science
  • ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
We investigate supervised learning strategies that improve the training of neural network audio classifiers on small annotated collections. In particular, we study whether (i) a naive regularization of the solution space, (ii) prototypical networks, (iii) transfer learning, or (iv) their combination, can foster deep learning models to better leverage a small amount of training examples. To this end, we evaluate (i–iv) for the tasks of acoustic event recognition and acoustic scene classification… 

Figures from this paper

A Study of Few-Shot Audio Classification
TLDR
This research addresses two audio classification tasks with the Prototypical Network few-shot learning algorithm, and assess performance of various encoder architectures, which include recurrent neural networks, as well as one- and two-dimensional convolutional neural networks.
Few-Shot Continual Learning for Audio Classification
TLDR
This work introduces a few-shot continual learning framework for audio classification, where a trained base classifier is continuously expanded to recognize novel classes based on only few labeled data at inference time, which enables fast and interactive model updates by end-users with minimal human effort.
Variational Information Bottleneck for Effective Low-resource Audio Classification
TLDR
Evaluation on a few audio datasets shows that the VIB framework is ready-to-use and could be easily utilized with many other state-of-the-art network architectures, and outperforms baseline methods.
Learning Hierarchy Aware Embedding From Raw Audio for Acoustic Scene Classification
  • V. Abrol, Pulkit Sharma
  • Computer Science
    IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2020
TLDR
This work proposes a raw waveform based end-to-end ASC system using convolutional neural network that leverages the hierarchical relations between acoustic categories to improve the classification performance and uses a prototypical model.
Prototypical Networks for Domain Adaptation in Acoustic Scene Classification
TLDR
This work explores a metric learning approach called prototypical networks using the TUT Urban Acoustic Scenes dataset, which consists of 10 different acoustic scenes recorded across 10 cities, and concludes that metric learning is a promising approach towards addressing the domain adaptation problem in ASC.
Improving Semi-Supervised Learning for Audio Classification with FixMatch
Including unlabeled data in the training process of neural networks using Semi-Supervised Learning (SSL) has shown impressive results in the image domain, where state-of-the-art results were obtained
Who Calls The Shots? Rethinking Few-Shot Learning for Audio
TLDR
A series of experiments lead to audio-specific insights on few-shot learning, some of which are at odds with recent findings in the image domain: there is no best one-size- fits-all model, method, and support set selection criterion, and it depends on the expected application scenario.
On Improved Training of CNN for Acoustic Source Localisation
TLDR
It is found that training with speech or music signals produces a relative improvement in DoA accuracy for a variety of audio classes across 16 acoustic conditions and 9 DoAs, amounting to an average improvement of around 17% and 19% respectively when compared to training with spectrally flat random signals.
Urban Sound Tagging using Convolutional Neural Networks
TLDR
It is shown that using pre-trained image classification models along with the usage of data augmentation techniques results in higher performance over alternative approaches.
Learning from Very Few Samples: A Survey
TLDR
This survey extensively review 300+ papers of FSL spanning from the 2000s to 2019 and provides a timely and comprehensive survey for FSL, which categorize FSL approaches into the generative model based and discriminative modelBased kinds in principle, and emphasize particularly on the meta learning based FSL approach.
...
1
2
3
4
...

References

SHOWING 1-10 OF 36 REFERENCES
CNN architectures for large-scale audio classification
TLDR
This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.
Unsupervised feature learning for audio classification using convolutional deep belief networks
In recent years, deep learning approaches have gained significant interest as a way of building hierarchical representations from unlabeled data. However, to our knowledge, these deep learning
Deep Learning for Audio Transcription on Low-Resource Datasets
TLDR
This paper proposes factorising the final task of audio transcription into multiple intermediate tasks in order to improve the training performance when dealing with this kind of low-resource datasets.
Data-efficient weakly supervised learning for low-resource audio event detection using deep learning
TLDR
A data-efficient training of a stacked convolutional and recurrent neural network is proposed in a multi instance learning setting for which a new loss function is introduced that leads to improved training compared to the usual approaches for weakly supervised learning.
Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging
TLDR
A shrinking deep neural network (DNN) framework incorporating unsupervised feature learning to handle the multilabel classification task and a symmetric or asymmetric deep denoising auto-encoder (syDAE or asyDAE) to generate new data-driven features from the logarithmic Mel-filter banks features.
GENERATIVE ADVERSARIAL NETWORK BASED ACOUSTIC SCENE TRAINING SET AUGMENTATION AND SELECTION USING SVM HYPERPLANE
TLDR
This paper proposes to use Support Vector Machine (SVM) hyper plane for each class as reference for selecting samples, which have class discriminative information, and usage of the generated features could improve ASC performance.
Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification
TLDR
It is shown that the improved performance stems from the combination of a deep, high-capacity model and an augmented training set: this combination outperforms both the proposed CNN without augmentation and a “shallow” dictionary learning model with augmentation.
Optimization as a Model for Few-Shot Learning
Prototypical Networks for Few-shot Learning
TLDR
This work proposes Prototypical Networks for few-shot classification, and provides an analysis showing that some simple design decisions can yield substantial improvements over recent approaches involving complicated architectural choices and meta-learning.
Transfer Learning for Speech Recognition on a Budget
TLDR
This work conducts several systematic experiments adapting a Wav2Letter convolutional neural network originally trained for English ASR to the German language, showing that this technique allows faster training on consumer-grade resources while requiring less training data in order to achieve the same accuracy.
...
1
2
3
4
...