• Corpus ID: 235694579

Improving Sound Event Classification by Increasing Shift Invariance in Convolutional Neural Networks

@article{Fonseca2021ImprovingSE,
  title={Improving Sound Event Classification by Increasing Shift Invariance in Convolutional Neural Networks},
  author={Eduardo Fonseca and Andr{\'e}s Ferraro and Xavier Serra},
  journal={ArXiv},
  year={2021},
  volume={abs/2107.00623}
}
Recent studies have put into question the commonly assumed shift invariance property of convolutional networks, showing that small shifts in the input can affect the output predictions substantially. In this paper, we ask whether lack of shift invariance is a problem in sound event classification, and whether there are benefits in addressing it. Specifically, we evaluate two pooling methods to improve shift invariance in CNNs, based on low-pass filtering and adaptive sampling of incoming… 

Figures and Tables from this paper

Learning strides in convolutional neural networks

The first downsampling layer with learnable strides, DiffStride, which learns the size of a cropping mask in the Fourier domain, that effectively performs resizing in a differentiable way and allows trading off accuracy for efficiency on ImageNet.

CMKD: CNN/Transformer-Based Cross-Model Knowledge Distillation for Audio Classification

An intriguing interaction is found between the two very different models CNN and AST models are good teachers for each other and when either of them is used as the teacher and the other model is trained as the student via knowledge distillation, the performance of the student model noticeably improves, and in many cases, is better than the teacher model.

BYOL for Audio: Exploring Pre-trained General-purpose Audio Representations

It is hypothesized that representations effective for general audio tasks should provide multiple aspects of robust features of the input sound, and a self-supervised learning method is proposed: Bootstrap Your Own Latent for Audio (BYOL-A, pronounced ”viola”).

FSD50K: An Open Dataset of Human-Labeled Sound Events

FSD50K is introduced, an open dataset containing over 51 k audio clips totalling over 100 h of audio manually labeled using 200 classes drawn from the AudioSet Ontology, to provide an alternative benchmark dataset and thus foster SER research.

HEAR 2021: Holistic Evaluation of Audio Representations

Open evaluation code, submitted models and datasets are key contributions, enabling comprehensive and reproducible evaluation, as well as previously impossible longitudinal studies.

HEAR: Holistic Evaluation of Audio Representations

The aim of the HEAR benchmark is to develop a general-purpose audio representation that provides a strong basis for learning in a wide variety of tasks and scenarios, including speech, environmental sound, and music.

References

SHOWING 1-10 OF 33 REFERENCES

Truly shift-invariant convolutional neural networks

Adapt polyphase sampling (APS) is proposed, a simple sub-sampling scheme that allows convolutional neural networks to achieve 100% consistency in classification performance under shifts, without any loss in accuracy.

Making Convolutional Networks Shift-Invariant Again

This work demonstrates that anti-aliasing by low-pass filtering before downsampling, a classical signal processing technique has been undeservingly overlooked in modern deep networks, is compatible with existing architectural components, such as max-pooling and strided-convolution.

Evaluation of CNN-based Automatic Music Tagging Models

A consistent evaluation of different music tagging models on three datasets is conducted and reference results using common evaluation metrics are provided and all the models are evaluated with perturbed inputs to investigate the generalization capabilities concerning time stretch, pitch shift, dynamic range compression, and addition of white noise.

CNN architectures for large-scale audio classification

This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.

Model-Agnostic Approaches To Handling Noisy Labels When Training Sound Event Classifiers

This work evaluates simple and efficient model-agnostic approaches to handling noisy labels when training sound event classifiers, namely label smoothing regularization, mixup and noise-robust loss functions, which can be easily incorporated to existing deep learning pipelines without need for network modifications or extra resources.

Why do deep convolutional networks generalize so poorly to small image transformations?

The results indicate that the problem of insuring invariance to small image transformations in neural networks while preserving high accuracy remains unsolved.

PSLA: Improving Audio Event Classification with Pretraining, Sampling, Labeling, and Aggregation

PSLA is presented, a collection of training techniques that can noticeably boost the model accuracy including ImageNet pretraining, balanced sampling, data augmentation, label enhancement, model aggregation and their design choices that achieves a new state-of-the-art mean average precision on AudioSet.

Learning Sound Event Classifiers from Web Audio with Noisy Labels

Experiments suggest that training with large amounts of noisy data can outperform training with smaller amounts of carefully-labeled data, and it is shown that noise-robust loss functions can be effective in improving performance in presence of corrupted labels.

End-To-End Auditory Object Recognition Via Inception Nucleus

A novel end-to-end deep neural network to map the raw waveform inputs to sound class labels and includes an "inception nucleus" that optimizes the size of convolutional filters on the fly that results in reducing engineering efforts dramatically.

Harmonic Networks: Deep Translation and Rotation Equivariance

H-Nets are presented, a CNN exhibiting equivariance to patch-wise translation and 360-rotation, and it is demonstrated that their layers are general enough to be used in conjunction with the latest architectures and techniques, such as deep supervision and batch normalization.