ESResNet: Environmental Sound Classification Based on Visual Domain Models

@article{Guzhov2021ESResNetES,
  title={ESResNet: Environmental Sound Classification Based on Visual Domain Models},
  author={Andrey Guzhov and Federico Raue and J{\"o}rn Hees and Andreas R. Dengel},
  journal={2020 25th International Conference on Pattern Recognition (ICPR)},
  year={2021},
  pages={4933-4940}
}
Environmental Sound Classification (ESC) is an active research area in the audio domain and has seen a lot of progress in the past years. However, many of the existing approaches achieve high accuracy by relying on domain-specific features and architectures, making it harder to benefit from advances in other fields (e.g., the image domain). Additionally, some of the past successes have been attributed to a discrepancy of how results are evaluated (i.e., on unofficial splits of the UrbanSound8K… 

Figures and Tables from this paper

ESResNe(X)t-fbsp: Learning Robust Time-Frequency Transformation of Audio
TLDR
A new time-frequency transformation layer that is based on complex frequency B-spline (fbsp) wavelets being used with a high-performance audio classification model, which provides an accuracy improvement over the previously used Short-Time Fourier Transform (STFT) on standard datasets.
Urban Sound Classification : striving towards a fair comparison
TLDR
This paper presents the DCASE 2020 task 5 winning solution which aims at helping the monitoring of urban noise pollution, and provides a fair comparison by using the same input representation, metrics and optimizer to assess performances.
AudioCLIP: Extending CLIP to Image, Text and Audio
TLDR
The proposed model incorporates the ESResNeXt audio-model into the CLIP framework using the AudioSet dataset, which enables the proposed model to perform bimodal and unimodal classification and querying, while keeping CLIP’s ability to generalize to unseen datasets in a zero-shot inference fashion.
PSLA: Improving Audio Event Classification with Pretraining, Sampling, Labeling, and Aggregation
TLDR
PSLA is presented, a collection of training techniques that can noticeably boost the model accuracy including ImageNet pretraining, balanced sampling, data augmentation, label enhancement, model aggregation and their design choices that achieves a new state-of-the-art mean average precision on AudioSet.
CMKD: CNN/Transformer-Based Cross-Model Knowledge Distillation for Audio Classification
TLDR
An intriguing interaction is found between the two very different models CNN and AST models are good teachers for each other and when either of them is used as the teacher and the other model is trained as the student via knowledge distillation, the performance of the student model noticeably improves, and in many cases, is better than the teacher model.
ERANNs: Efficient Residual Audio Neural Networks for Audio Pattern Recognition
TLDR
A new convolutional neural network architecture and a method for improving the inference speed of CNN-based systems for APR tasks are proposed and improved, as confirmed in experiments conducted on four audio datasets.
CLAR: Contrastive Learning of Auditory Representations
TLDR
By combining all these methods and with substantially less labeled data, the CLAR framework achieves significant improvement on prediction performance compared to supervised approach and converges faster with significantly better representations.
A Sequential Self Teaching Approach for Improving Generalization in Sound Event Recognition
TLDR
It is harder to learn sounds in adverse situations such as from weakly labeled and/or noisy labeled data, and in these situations a single stage of learning is not sufficient, so a sequential stage-wise learning process is proposed that improves generalization capabilities of a given modeling system.
Combination of Time-domain, Frequency-domain, and Cepstral-domain Acoustic Features for Speech Commands Classification
TLDR
A novel improvement BSR feature called BSR-float16 is proposed to represent floating-point values more precisely to improve the final classification accuracy and the fusion results also showed better noise robustness.
Unsupervised Discriminative Learning of Sounds for Audio Event Classification
TLDR
On several audio event classification benchmarks, a fast and effective alternative is shown that pre-trains the model unsupervised, only on audio data and yet delivers on-par performance with ImageNet pre-training.
...
1
2
3
4
...

References

SHOWING 1-10 OF 42 REFERENCES
Learning Attentive Representations for Environmental Sound Classification
TLDR
The role of convolution filters in detecting energy modulation patterns and propose a channel attention mechanism to focus on the semantically relevant channels generated by corresponding filters to achieve the state-of-the-art or competitive results in terms of classification accuracy.
Deep Convolutional Neural Network with Mixup for Environmental Sound Classification
TLDR
A novel deep convolutional neural network is proposed to be used for environmental sound classification (ESC) tasks that uses stacked Convolutional and pooling layers to extract high-level feature representations from spectrogram-like features.
Environment Sound Classification Using a Two-Stream CNN Based on Decision-Level Fusion
TLDR
The proposed TSCNN-DS model achieves a classification accuracy of 97.2%, which is the highest taxonomic accuracy on UrbanSound8K datasets compared to existing models.
Learning environmental sounds with end-to-end convolutional neural network
  • Yuji Tokozume, T. Harada
  • Computer Science
    2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2017
TLDR
This paper proposes a novel end-to-end ESC system using a convolutional neural network (CNN) and achieves a 6.5% improvement in classification accuracy over the state-of-the-art logmel-CNN with the static and delta log-mel feature, simply by combining the system and logMel-CNN.
Learning discriminative and robust time-frequency representations for environmental sound classification
TLDR
A new method is proposed, called time-frequency enhancement block (TFBlock), which temporal attention and frequency attention are employed to enhance the features from relevant frames and frequency bands, which improves the classification performance and also exhibits robustness to noise.
Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification
TLDR
It is shown that the improved performance stems from the combination of a deep, high-capacity model and an augmented training set: this combination outperforms both the proposed CNN without augmentation and a “shallow” dictionary learning model with augmentation.
Classifying environmental sounds using image recognition networks
ESC: Dataset for Environmental Sound Classification
TLDR
A new annotated collection of 2000 short clips comprising 50 classes of various common sound events, and an abundant unified compilation of 250000 unlabeled auditory excerpts extracted from recordings available through the Freesound project are presented.
Novel Phase Encoded Mel Filterbank Energies for Environmental Sound Classification
TLDR
This paper proposes to use phase encoded filterbank energies (PEFBEs) for Environment Sound Classification task, and uses Convolutional Neural Network (CNN) as a pattern classifier for feature set.
...
1
2
3
4
5
...