• Corpus ID: 235720897

J ul 2 02 1 IMPROVING SOUND EVENT CLASSIFICATION BY INCREASING SHIFT INVARIANCE IN CONVOLUTIONAL NEURAL NETWORKS

@inproceedings{Fonseca2021JU2,
  title={J ul 2 02 1 IMPROVING SOUND EVENT CLASSIFICATION BY INCREASING SHIFT INVARIANCE IN CONVOLUTIONAL NEURAL NETWORKS},
  author={Eduardo Fonseca and Andr{\'e}s Ferraro and Xavier Serra},
  year={2021}
}
Recent studies have put into question the commonly assumed shift invariance property of convolutional networks, showing that small shifts in the input can affect the output predictions substantially. In this paper, we ask whether lack of shift invariance is a problem in sound event classification, and whether there are benefits in addressing it. Specifically, we evaluate two pooling methods to improve shift invariance in CNNs, based on low-pass filtering and adaptive sampling of incoming… 

Tables from this paper

References

SHOWING 1-10 OF 44 REFERENCES

Truly shift-invariant convolutional neural networks

Adapt polyphase sampling (APS) is proposed, a simple sub-sampling scheme that allows convolutional neural networks to achieve 100% consistency in classification performance under shifts, without any loss in accuracy.

Making Convolutional Networks Shift-Invariant Again

This work demonstrates that anti-aliasing by low-pass filtering before downsampling, a classical signal processing technique has been undeservingly overlooked in modern deep networks, is compatible with existing architectural components, such as max-pooling and strided-convolution.

Evaluation of CNN-based Automatic Music Tagging Models

A consistent evaluation of different music tagging models on three datasets is conducted and reference results using common evaluation metrics are provided and all the models are evaluated with perturbed inputs to investigate the generalization capabilities concerning time stretch, pitch shift, dynamic range compression, and addition of white noise.

CNN architectures for large-scale audio classification

This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.

Model-Agnostic Approaches To Handling Noisy Labels When Training Sound Event Classifiers

This work evaluates simple and efficient model-agnostic approaches to handling noisy labels when training sound event classifiers, namely label smoothing regularization, mixup and noise-robust loss functions, which can be easily incorporated to existing deep learning pipelines without need for network modifications or extra resources.

PSLA: Improving Audio Event Classification with Pretraining, Sampling, Labeling, and Aggregation

PSLA is presented, a collection of training techniques that can noticeably boost the model accuracy including ImageNet pretraining, balanced sampling, data augmentation, label enhancement, model aggregation and their design choices that achieves a new state-of-the-art mean average precision on AudioSet.

Anti-Aliasing Regularization in Stacking Layers

The impact of the commonly used stacking layer in LSTM-based ASR models is studied and it is shown that aliasing is likely to occur and the relative word error rate is reduced by up to 5%.

Why do deep convolutional networks generalize so poorly to small image transformations?

The results indicate that the problem of insuring invariance to small image transformations in neural networks while preserving high accuracy remains unsolved.

Addressing Missing Labels in Large-Scale Sound Event Recognition Using a Teacher-Student Framework With Loss Masking

This work proposes a simple and model-agnostic method based on a teacher-student framework with loss masking to first identify the most critical missing label candidates, and then ignore their contribution during the learning process, finding that a simple optimisation of the training label set improves recognition performance without additional computation.

Training general-purpose audio tagging networks with noisy labels and iterative self-verification

This paper describes our submission to the first Freesound generalpurpose audio tagging challenge carried out within the DCASE 2018 challenge. Our proposal is based on a fully convolutional neural